Text Generation
Transformers
Safetensors
zaya
conversational
BerenMillidge commited on
Commit
68b631d
·
verified ·
1 Parent(s): ba27dd5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -8,14 +8,14 @@ ZAYA1 is an 800m active/8.3B total parameter MoE model, and the first trained en
8
 
9
  Our ZAYA1 base model benchmark performance is extremely competitive with the SoTA Qwen3 series of models of comparable scale, and outperforms comparable western open-source models such as SmolLM3, and Phi4. ZAYA1-base excels especially at complex and challenging mathematical and STEM reasoning tasks, nearly matching the performance of SoTA Qwen3 thinking models under high pass@k settings even prior to explicit post-training for reasoning, and exceeds other strong reasoning models such as Phi4-reasoning, and Deepseek-R1-Distill.
10
 
11
- Details of our pretraining efforts, hardware specific optimizations, and ZAYA1 base model benchmarks are described in the [accompanying technical report](-/TODO)
12
 
13
  This version of the model has undergone an additional 1T tokens of reasoning-focused midtraining.
14
 
15
 
16
  ## Model Details
17
 
18
- ZAYA1's architecture includes several innovations developed at Zyphra. These include
19
 
20
  - **Compressed Convolutional Attention (CCA)**: [This novel attention](-/TODO) mechanism performs attention entirely in the latent space enabling significant reductions in parameter count, prefill compute, and KV cache size compared to alternative attention mechanisms, while also being more performant in loss/flop.
21
  - **ZAYA1 Router**: The ZAYA1 router makes fundamental improvements to the linear router used in almost all existing large-scale MoE models. The ZAYA1 router replaces the linear with a downprojection followed by a depth-mixing EDA layer then a three-layer MLP per expert to add significant nonlinear expressivity to the router.
@@ -24,7 +24,7 @@ ZAYA1's architecture includes several innovations developed at Zyphra. These inc
24
 
25
  ![zaya_arch](https://cdn-uploads.huggingface.co/production/uploads/65c05e75c084467acab2f84a/Ih8RnOPNbtRzaVcH16ar-.png)
26
 
27
- ZAYA1-reasoning-base uses the [Gemma3](https://ai.google.dev/gemma/terms) tokenizer
28
 
29
  ## Performance
30
 
 
8
 
9
  Our ZAYA1 base model benchmark performance is extremely competitive with the SoTA Qwen3 series of models of comparable scale, and outperforms comparable western open-source models such as SmolLM3, and Phi4. ZAYA1-base excels especially at complex and challenging mathematical and STEM reasoning tasks, nearly matching the performance of SoTA Qwen3 thinking models under high pass@k settings even prior to explicit post-training for reasoning, and exceeds other strong reasoning models such as Phi4-reasoning, and Deepseek-R1-Distill.
10
 
11
+ Details of our pretraining efforts, hardware specific optimizations, and ZAYA1 base model benchmarks are described in the [accompanying technical report](-/TODO).
12
 
13
  This version of the model has undergone an additional 1T tokens of reasoning-focused midtraining.
14
 
15
 
16
  ## Model Details
17
 
18
+ ZAYA1's architecture includes several innovations developed at Zyphra. These include:
19
 
20
  - **Compressed Convolutional Attention (CCA)**: [This novel attention](-/TODO) mechanism performs attention entirely in the latent space enabling significant reductions in parameter count, prefill compute, and KV cache size compared to alternative attention mechanisms, while also being more performant in loss/flop.
21
  - **ZAYA1 Router**: The ZAYA1 router makes fundamental improvements to the linear router used in almost all existing large-scale MoE models. The ZAYA1 router replaces the linear with a downprojection followed by a depth-mixing EDA layer then a three-layer MLP per expert to add significant nonlinear expressivity to the router.
 
24
 
25
  ![zaya_arch](https://cdn-uploads.huggingface.co/production/uploads/65c05e75c084467acab2f84a/Ih8RnOPNbtRzaVcH16ar-.png)
26
 
27
+ ZAYA1-reasoning-base uses the [Gemma3](https://ai.google.dev/gemma/terms) tokenizer.
28
 
29
  ## Performance
30