File size: 3,120 Bytes
a7e5596
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
---
tags:
- mixture-of-experts
- moe
- transformer
- language-model
- pytorch
- conditional-computation
datasets:
- custom
pipeline_tag: text-generation
license: mit
---

# Mixture-of-Experts Language Models

A PyTorch implementation exploring conditional computation in Transformers through Mixture-of-Experts (MoE).

## Models

This repository contains two MoE architectures:

### 1. Sparse MoE (Top-K Routing)
Routes each token to a fixed number of experts (k=2), increasing model capacity without proportionally increasing compute.

### 2. Dynamic MoE (Confidence-Based Routing)
Dynamically adjusts the number of experts per token based on routing confidence—"easy" tokens use fewer experts, "hard" tokens use more.

## Model Details

| Parameter | Sparse MoE | Dynamic MoE |
|-----------|------------|-------------|
| Layers | 4 | 4 |
| Hidden Dim | 512 | 512 |
| FFN Dim | 2048 | 2048 |
| Attention Heads | 8 | 8 |
| Experts | 8 | 4 |
| Routing | Top-2 | τ=0.8 threshold |
| Context Length | 256 | 256 |
| Vocab Size | 10,000 | 10,000 |

## Architecture

```
Input → Embedding → [Transformer Block × N] → RMSNorm → Linear → Output

Transformer Block:
  └─ RMSNorm → Multi-Head Self-Attention → Residual
  └─ RMSNorm → MoE Layer → Residual

MoE Layer:
  └─ Router (softmax gating)
  └─ Expert Selection (Top-K or Dynamic)
  └─ Weighted Expert Outputs
```

## Training

Both models were trained with:
- **Optimizer**: AdamW (β1=0.9, β2=0.95)
- **Learning Rate**: 3e-4 with cosine decay
- **Warmup Steps**: 2,000
- **Weight Decay**: 0.1

### Loss Functions

**Sparse MoE:**
```
L = L_CE + α * L_balance
```

**Dynamic MoE:**
```
L = L_CE + β * L_balance + γ * L_entropy
```

Where:
- `L_CE`: Cross-entropy loss
- `L_balance`: Load balancing loss (encourages uniform expert utilization)
- `L_entropy`: Entropy regularization (encourages sparse routing)

## Usage

```python
import torch
from moe.moelm import MoeLM, DynamicMOELM

# Load Sparse MoE
sparse_model = MoeLM(
    vocab_size=10000,
    num_layers=4,
    context_length=256,
    d_model=512,
    d_ff=2048,
    num_heads=8,
    num_experts=8,
    top_k=2
)
sparse_model.load_state_dict(torch.load("sparse_moe_final.pt"))

# Load Dynamic MoE
dynamic_model = DynamicMOELM(
    vocab_size=10000,
    num_layers=4,
    context_length=256,
    d_model=512,
    d_ff=2048,
    num_heads=8,
    num_experts=4,
    confidence_threshold=0.8
)
dynamic_model.load_state_dict(torch.load("dynamic_moe_final.pt"))
```

## Files

| File | Description |
|------|-------------|
| `sparse_moe_final.pt` | Sparse MoE model weights |
| `dynamic_moe_final.pt` | Dynamic MoE model weights |
| `sparse_moe_config.json` | Sparse MoE configuration |
| `dynamic_moe_config.json` | Dynamic MoE configuration |

## Citation

```bibtex
@misc{moe-lm-2024,
  title={Mixture-of-Experts Language Model},
  author={Chaitanya},
  year={2024},
  url={https://github.com/chaitanya/transformers-and-MOE}
}
```

## Reference

Based on ["Harder Tasks Need More Experts: Dynamic Routing in MoE Models"](https://arxiv.org/abs/2403.07652)

## License

MIT