Update pipeline tag, add project page and paper links (#1)
Browse files- Update pipeline tag, add project page and paper links (a82489f0d78ddfc422ba3e9b2babc9cf69652886)
Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>
README.md
CHANGED
|
@@ -1,24 +1,25 @@
|
|
| 1 |
---
|
|
|
|
|
|
|
| 2 |
language:
|
| 3 |
- en
|
| 4 |
library_name: transformers
|
| 5 |
-
|
|
|
|
| 6 |
tags:
|
| 7 |
- audio
|
| 8 |
- speech
|
| 9 |
- autoregressive
|
| 10 |
- transformers
|
| 11 |
- custom_code
|
| 12 |
-
datasets:
|
| 13 |
-
- LibriLight
|
| 14 |
-
license: apache-2.0
|
| 15 |
pretty_name: AuriStream1B
|
| 16 |
---
|
| 17 |
|
| 18 |
-
|
| 19 |
# AuriStream-1B
|
| 20 |
|
| 21 |
-
|
|
|
|
|
|
|
| 22 |
|
| 23 |
---
|
| 24 |
|
|
@@ -126,8 +127,6 @@ with torch.no_grad():
|
|
| 126 |
prompt_tokens, rollout_steps, temp=0.7, top_k=50, top_p=0.95, seed=0
|
| 127 |
)
|
| 128 |
full_tokens = torch.cat([prompt_tokens, pred_tokens], dim=1) # (1, L+K)
|
| 129 |
-
|
| 130 |
-
|
| 131 |
```
|
| 132 |
|
| 133 |
## Architecture overview
|
|
@@ -152,5 +151,4 @@ If you use this model, please cite:
|
|
| 152 |
doi = {10.21437/Interspeech.2025-2044},
|
| 153 |
issn = {2958-1796}
|
| 154 |
}
|
| 155 |
-
```
|
| 156 |
-
|
|
|
|
| 1 |
---
|
| 2 |
+
datasets:
|
| 3 |
+
- LibriLight
|
| 4 |
language:
|
| 5 |
- en
|
| 6 |
library_name: transformers
|
| 7 |
+
license: apache-2.0
|
| 8 |
+
pipeline_tag: audio-to-audio
|
| 9 |
tags:
|
| 10 |
- audio
|
| 11 |
- speech
|
| 12 |
- autoregressive
|
| 13 |
- transformers
|
| 14 |
- custom_code
|
|
|
|
|
|
|
|
|
|
| 15 |
pretty_name: AuriStream1B
|
| 16 |
---
|
| 17 |
|
|
|
|
| 18 |
# AuriStream-1B
|
| 19 |
|
| 20 |
+
[📚 Paper](https://huggingface.co/papers/2508.11598) - [🌐 Project Page](https://tukoresearch.github.io/auristream-speech/)
|
| 21 |
+
|
| 22 |
+
**AuriStream** is a biologically-inspired, GPT-style autoregressive Transformer trained to predict tokens from the speech stream (denoted as **cochlear tokens**). These cochlear tokens are discrete codes produced by a companion “WavCoch” tokenizer (a model trained to predict the time-frequency cochleagram from a waveform, with a LFQ bottleneck for token read-out). AuriStream utilizes a long context window of (~20 s, ~4096 tokens) and is trained on **LibriLight (~60k hours)** for **500k steps**. It learns meaningful representations about e.g. phoneme/word identity and can predict future tokens to generate **speech continuations**. Inputs are cochlear **token IDs**; use it with a WavCoch tokenizer for audio -> tokens.
|
| 23 |
|
| 24 |
---
|
| 25 |
|
|
|
|
| 127 |
prompt_tokens, rollout_steps, temp=0.7, top_k=50, top_p=0.95, seed=0
|
| 128 |
)
|
| 129 |
full_tokens = torch.cat([prompt_tokens, pred_tokens], dim=1) # (1, L+K)
|
|
|
|
|
|
|
| 130 |
```
|
| 131 |
|
| 132 |
## Architecture overview
|
|
|
|
| 151 |
doi = {10.21437/Interspeech.2025-2044},
|
| 152 |
issn = {2958-1796}
|
| 153 |
}
|
| 154 |
+
```
|
|
|