File size: 3,120 Bytes
a7e5596 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 |
---
tags:
- mixture-of-experts
- moe
- transformer
- language-model
- pytorch
- conditional-computation
datasets:
- custom
pipeline_tag: text-generation
license: mit
---
# Mixture-of-Experts Language Models
A PyTorch implementation exploring conditional computation in Transformers through Mixture-of-Experts (MoE).
## Models
This repository contains two MoE architectures:
### 1. Sparse MoE (Top-K Routing)
Routes each token to a fixed number of experts (k=2), increasing model capacity without proportionally increasing compute.
### 2. Dynamic MoE (Confidence-Based Routing)
Dynamically adjusts the number of experts per token based on routing confidence—"easy" tokens use fewer experts, "hard" tokens use more.
## Model Details
| Parameter | Sparse MoE | Dynamic MoE |
|-----------|------------|-------------|
| Layers | 4 | 4 |
| Hidden Dim | 512 | 512 |
| FFN Dim | 2048 | 2048 |
| Attention Heads | 8 | 8 |
| Experts | 8 | 4 |
| Routing | Top-2 | τ=0.8 threshold |
| Context Length | 256 | 256 |
| Vocab Size | 10,000 | 10,000 |
## Architecture
```
Input → Embedding → [Transformer Block × N] → RMSNorm → Linear → Output
Transformer Block:
└─ RMSNorm → Multi-Head Self-Attention → Residual
└─ RMSNorm → MoE Layer → Residual
MoE Layer:
└─ Router (softmax gating)
└─ Expert Selection (Top-K or Dynamic)
└─ Weighted Expert Outputs
```
## Training
Both models were trained with:
- **Optimizer**: AdamW (β1=0.9, β2=0.95)
- **Learning Rate**: 3e-4 with cosine decay
- **Warmup Steps**: 2,000
- **Weight Decay**: 0.1
### Loss Functions
**Sparse MoE:**
```
L = L_CE + α * L_balance
```
**Dynamic MoE:**
```
L = L_CE + β * L_balance + γ * L_entropy
```
Where:
- `L_CE`: Cross-entropy loss
- `L_balance`: Load balancing loss (encourages uniform expert utilization)
- `L_entropy`: Entropy regularization (encourages sparse routing)
## Usage
```python
import torch
from moe.moelm import MoeLM, DynamicMOELM
# Load Sparse MoE
sparse_model = MoeLM(
vocab_size=10000,
num_layers=4,
context_length=256,
d_model=512,
d_ff=2048,
num_heads=8,
num_experts=8,
top_k=2
)
sparse_model.load_state_dict(torch.load("sparse_moe_final.pt"))
# Load Dynamic MoE
dynamic_model = DynamicMOELM(
vocab_size=10000,
num_layers=4,
context_length=256,
d_model=512,
d_ff=2048,
num_heads=8,
num_experts=4,
confidence_threshold=0.8
)
dynamic_model.load_state_dict(torch.load("dynamic_moe_final.pt"))
```
## Files
| File | Description |
|------|-------------|
| `sparse_moe_final.pt` | Sparse MoE model weights |
| `dynamic_moe_final.pt` | Dynamic MoE model weights |
| `sparse_moe_config.json` | Sparse MoE configuration |
| `dynamic_moe_config.json` | Dynamic MoE configuration |
## Citation
```bibtex
@misc{moe-lm-2024,
title={Mixture-of-Experts Language Model},
author={Chaitanya},
year={2024},
url={https://github.com/chaitanya/transformers-and-MOE}
}
```
## Reference
Based on ["Harder Tasks Need More Experts: Dynamic Routing in MoE Models"](https://arxiv.org/abs/2403.07652)
## License
MIT |