Module 'torchaudio' has no attribute 'AudioMetaData'

Hi, I was writing a script for a diarization + transcription of audio files and I came across an error

Traceback (most recent call last):
File “/home/user/diarization/repos/scripts/diaritranscribe3.py”, line 69, in
from pyannote.audio import Inference, Model, Pipeline
File “/home/user/.cache/uv/environments-v2/diaritranscribe3-3f9949c47f20e532/lib/python3.12/site-packages/pyannote/audio/init.py”, line 29, in
from .core.inference import Inference
File “/home/user/.cache/uv/environments-v2/diaritranscribe3-3f9949c47f20e532/lib/python3.12/site-packages/pyannote/audio/core/inference.py”, line 35, in
from ``pyannote.audio.core.io`` import AudioFile
File “/home/user/.cache/uv/environments-v2/diaritranscribe3-3f9949c47f20e532/lib/python3.12/site-packages/pyannote/audio/core/io.py”, line 60, in
) → torchaudio.AudioMetaData:
^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: module ‘torchaudio’ has no attribute ‘AudioMetaData’

It is a fail relatively early in the scrip, in the import block, so I am betting that it is some sort of compatibility issue. I also think it is a compatibility issue because this script is built upon a previous itteration that wasn’t accurate that also required torch audio and it didn’t give me any issues. The only other thing before that is the dependencies block

#!/usr/bin/env -S uv run

# /// script

# requires-python = “>=3.10”

# dependencies = [

#   “faster-whisper”,

#   “nvidia-cublas-cu12”,

#   “nvidia-cudnn-cu12”,

#   “numpy”,

#   “pyannote.audio>=3.1”,

#   “nvidia-cublas”,

#   “nvidia-cudnn-cu13”,

#   “nvidia-npp”,

#   “scikit-learn”,

#   “torch”,

#   “torchaudio”,

#   “torchcodec”,

#   “omegaconf”,

#   “brouhaha @ file:///home/user/diarization/repos/.venv/brouhaha-vad”,

# ]

# ///

How can I solve this?

Seems maybe just a compatibility issue. If you just want to work around the issue, pinning the library version should suffice, but if you want to use the latest version of the library, you may need to make some modifications to the script:


Fixing AttributeError: module 'torchaudio' has no attribute 'AudioMetaData' in a pyannote + faster-whisper + uv script

Short diagnosis

This is very likely a dependency compatibility issue, not a bug in your diarization or transcription logic.

The important clue is that the crash happens during this import:

from pyannote.audio import Inference, Model, Pipeline

The script is not reaching the real diarization/transcription part yet. Python is still importing pyannote.audio, and the crash happens inside pyannote’s audio I/O module:

) -> torchaudio.AudioMetaData:
      ^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: module 'torchaudio' has no attribute 'AudioMetaData'

That means pyannote code is referencing:

torchaudio.AudioMetaData

but the installed torchaudio package no longer exposes that object.

This fits the current TorchAudio transition: AudioMetaData and related audio I/O APIs were deprecated in TorchAudio 2.8 and removed in TorchAudio 2.9 as part of TorchAudio’s move into maintenance mode and the shift of media decoding/encoding functionality toward TorchCodec.

Useful references:

The likely version story is:

Your script asks uv for broad/latest-ish package versions
        ↓
uv resolves a newer TorchAudio, probably 2.9+
        ↓
the installed pyannote.audio code still references torchaudio.AudioMetaData
        ↓
import pyannote.audio fails before your own script logic runs

Why your current dependency block is fragile

Your current inline metadata has this shape:

#!/usr/bin/env -S uv run
# /// script
# requires-python = ">=3.10"
# dependencies = [
#   "faster-whisper",
#   "nvidia-cublas-cu12",
#   "nvidia-cudnn-cu12",
#   "numpy",
#   "pyannote.audio>=3.1",
#   "nvidia-cublas",
#   "nvidia-cudnn-cu13",
#   "nvidia-npp",
#   "scikit-learn",
#   "torch",
#   "torchaudio",
#   "torchcodec",
#   "omegaconf",
#   "brouhaha @ file:///home/user/diarization/repos/.venv/brouhaha-vad",
# ]
# ///

The main risk is here:

"pyannote.audio>=3.1",
"torch",
"torchaudio",
"torchcodec",

Those constraints are too broad for a fast-moving audio/ML stack.

They allow uv to pick a package family like:

pyannote.audio 3.x
torch 2.9.x
torchaudio 2.9.x
torchcodec 0.8.x or 0.9.x

That is exactly the kind of combination that can fail: pyannote 3.x-era code may still reference older TorchAudio APIs, while TorchAudio 2.9 removed APIs deprecated in 2.8.

This is not really uv’s fault. uv is resolving from the constraints you gave it. The problem is that the constraints are too loose for a stack where torch, torchaudio, torchcodec, CUDA libraries, FFmpeg, pyannote, and faster-whisper all interact.

Relevant docs:


Recommended fix: recover first with a pinned compatible stack

I would not start by rewriting the whole diarization/transcription pipeline. First, recover the existing script by pinning a compatible version family.

Use this dependency block first:

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.10,<3.14"
# dependencies = [
#   "faster-whisper",
#   "numpy",
#   "pyannote.audio==3.4.0",
#   "scikit-learn",
#   "torch==2.8.0",
#   "torchaudio==2.8.0",
#   "torchcodec==0.7.*",
#   "omegaconf",
#   "brouhaha @ file:///home/user/diarization/repos/.venv/brouhaha-vad",
# ]
# ///

Why these pins?

Package Reason
pyannote.audio==3.4.0 Keeps you on the pyannote 3.x generation, which is likely closer to your current script. The pyannote 3.4.0 release was a maintenance release that pinned related pyannote.{core,database,metrics,pipeline} dependencies to avoid breakage in the 3.x branch. See the pyannote 3.4.0 release note.
torch==2.8.0 Keeps PyTorch in the last generation before the TorchAudio 2.9 removal boundary.
torchaudio==2.8.0 Keeps torchaudio.AudioMetaData available. TorchAudio 2.8 has it, though deprecated. See the TorchAudio 2.8 docs.
torchcodec==0.7.* TorchCodec’s own compatibility table maps TorchCodec 0.7 to Torch 2.8. See the TorchCodec README.
requires-python = ">=3.10,<3.14" Your traceback shows Python 3.12. TorchCodec 0.7 supports Python >=3.9, <=3.13, so Python 3.12 is a reasonable target.

The practical point is simple:

TorchAudio 2.9+ removed AudioMetaData
        ↓
pyannote import crashes
        ↓
pin TorchAudio to 2.8.0
        ↓
AudioMetaData exists again
        ↓
pyannote can import

Remove the manually listed NVIDIA packages for the first recovery attempt

I would remove these from the first recovery attempt:

"nvidia-cublas-cu12",
"nvidia-cudnn-cu12",
"nvidia-cublas",
"nvidia-cudnn-cu13",
"nvidia-npp",

Reasons:

  1. They are not the cause of the current error.
  2. The current error is a Python import-time attribute lookup, not a CUDA runtime error.
  3. The block mixes CUDA 12 and CUDA 13 package names.
  4. Manually mixing NVIDIA runtime packages can make the environment harder to reason about.
  5. PyTorch CUDA wheel selection should be handled coherently through the PyTorch wheel/index strategy, not by mixing low-level NVIDIA packages casually.

This does not mean CUDA never matters. It does mean CUDA should be debugged after pyannote imports.

For faster-whisper GPU execution, you may later need CUDA/cuDNN-related fixes. The faster-whisper project documents CUDA and cuDNN expectations in its README:

But that is a second-stage issue. First fix:

from pyannote.audio import Inference, Model, Pipeline

Also fix the shebang and quotes

Use:

#!/usr/bin/env -S uv run --script

instead of:

#!/usr/bin/env -S uv run

The uv docs use uv run --script for scripts with inline metadata:

Also make sure your actual file uses straight quotes, not curly quotes.

Bad if literally present in the file:

“torch”

Good:

"torch"

If the curly quotes only appeared because of formatting while pasting into a forum, ignore this. If they are actually in the file, the inline metadata is not valid TOML.


Step-by-step recovery procedure

Step 1: Replace the dependency block

Use this exact header first:

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.10,<3.14"
# dependencies = [
#   "faster-whisper",
#   "numpy",
#   "pyannote.audio==3.4.0",
#   "scikit-learn",
#   "torch==2.8.0",
#   "torchaudio==2.8.0",
#   "torchcodec==0.7.*",
#   "omegaconf",
#   "brouhaha @ file:///home/user/diarization/repos/.venv/brouhaha-vad",
# ]
# ///

Step 2: Test pyannote import in isolation

Before testing the full diarization/transcription script, create a small file called check_pyannote_import.py:

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.10,<3.14"
# dependencies = [
#   "pyannote.audio==3.4.0",
#   "torch==2.8.0",
#   "torchaudio==2.8.0",
#   "torchcodec==0.7.*",
# ]
# ///

import sys
from importlib.metadata import version

import torch
import torchaudio
from pyannote.audio import Pipeline

print("python:", sys.version)
print("torch:", torch.__version__)
print("torchaudio:", torchaudio.__version__)
print("torchcodec:", version("torchcodec"))
print("AudioMetaData exists:", hasattr(torchaudio, "AudioMetaData"))
print("pyannote import OK")

Run it with a fresh resolution:

uv run --refresh --script check_pyannote_import.py

Expected output should include something like:

torch: 2.8.0...
torchaudio: 2.8.0...
torchcodec: 0.7...
AudioMetaData exists: True
pyannote import OK

The most important line is:

AudioMetaData exists: True

If that line is False, you are still not running with the TorchAudio version you think you are running.


Step 3: Inspect the resolved dependency tree

Run:

uv tree --script diaritranscribe3.py

Look for:

pyannote.audio==3.4.0
torch==2.8.0
torchaudio==2.8.0
torchcodec==0.7.x

For this recovery path, you do not want:

torchaudio==2.9.x
torch==2.9.x

TorchAudio and PyTorch should be matched. Do not use a mixed pair like:

torch 2.8 + torchaudio 2.9

or:

torch 2.9 + torchaudio 2.8

The safer recovery pair is:

torch 2.8.0 + torchaudio 2.8.0

Reference:


Step 4: Run your real script with refresh

After the minimal import test works:

uv run --refresh --script diaritranscribe3.py

If the script is executable:

chmod +x diaritranscribe3.py
./diaritranscribe3.py

Step 5: Lock the script after it works

Once the import works and the script begins running normally, lock the dependency set:

uv lock --script diaritranscribe3.py

uv supports lockfiles for PEP 723 inline scripts. The lockfile is created next to the script, for example:

diaritranscribe3.py.lock

Reference:

This is important because your current error is exactly the kind of failure that lockfiles prevent. Without a lockfile, the same script can work today and break later when a newer torchaudio, torchcodec, pyannote-core, pyannote-metrics, or other dependency becomes resolvable.


Why I recommend recovery before full migration

There are two possible paths:

Path Meaning When to choose it
Recovery path Keep your current pyannote 3.x-style script and pin compatible versions. Best first move when the script fails at import time and you want minimal code changes.
Migration path Move to current pyannote 4.x / community-1 / TorchCodec / FFmpeg assumptions. Better long-term, but may require code changes and may expose new TorchCodec/FFmpeg issues.

For your case, I would choose recovery first.

Reason: the traceback proves the import environment is broken. It does not prove that your diarization logic, faster-whisper logic, VAD logic, or speaker-label alignment logic is wrong.

The disciplined order is:

fix pyannote import
        ↓
test pyannote alone
        ↓
test faster-whisper alone
        ↓
test speaker/transcript alignment
        ↓
then consider migrating to newer pyannote conventions

What the forward-migration path would look like later

The current pyannote direction is more TorchCodec-centered. The current pyannote repository describes pyannote.audio as a PyTorch-based speaker diarization toolkit, and current pyannote usage increasingly assumes TorchCodec and FFmpeg for audio decoding.

References:

A newer pyannote-style snippet can look like this:

import torch
from pyannote.audio import Pipeline
from pyannote.audio.pipelines.utils.hook import ProgressHook

pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-community-1",
    token="<HUGGINGFACE_ACCESS_TOKEN>",
)

pipeline.to(torch.device("cuda"))

with ProgressHook() as hook:
    output = pipeline("audio.wav", hook=hook)

for turn, speaker in output.speaker_diarization:
    print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}")

In normal prose, that token placeholder would be written as \<HUGGINGFACE_ACCESS_TOKEN\>.

That newer path may be the right long-term direction, but it is a migration, not just a one-line dependency fix. It may change:

  • model name;
  • access/token handling;
  • audio decoding assumptions;
  • FFmpeg requirements;
  • TorchCodec version requirements;
  • output object shape;
  • how you iterate diarization results;
  • how you align diarization segments with transcript segments.

Older pyannote 3.x code commonly looks more like this:

for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}")

So moving from pyannote 3.x to pyannote 4.x may require real script edits. That is why I would first recover your current script.


What to test after the import works

After this import stops crashing:

from pyannote.audio import Inference, Model, Pipeline

test each subsystem separately.

1. Test PyTorch and CUDA

import torch

print("torch:", torch.__version__)
print("torch cuda build:", torch.version.cuda)
print("cuda available:", torch.cuda.is_available())

if torch.cuda.is_available():
    print("gpu:", torch.cuda.get_device_name(0))

If this prints:

cuda available: False

then the pyannote import may be fixed, but your PyTorch build is CPU-only or CUDA-incompatible. That is a separate problem.


2. Test pyannote alone on a small WAV

For the first test, avoid MP3/M4A/WEBM. Normalize to a small mono 16 kHz WAV:

ffmpeg -y -i input.mp3 -ac 1 -ar 16000 test.wav

Then test only diarization:

from pyannote.audio import Pipeline
import torch

pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token="<HUGGINGFACE_ACCESS_TOKEN>",
)

if torch.cuda.is_available():
    pipeline.to(torch.device("cuda"))

diarization = pipeline("test.wav")

for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(turn.start, turn.end, speaker)

In normal prose, write the token placeholder as \<HUGGINGFACE_ACCESS_TOKEN\>.

Useful model page:


3. Test faster-whisper alone

from faster_whisper import WhisperModel

model = WhisperModel("small", device="cuda", compute_type="float16")
segments, info = model.transcribe("test.wav", beam_size=5)

print("language:", info.language, info.language_probability)

for segment in segments:
    print(segment.start, segment.end, segment.text)

Important faster-whisper detail: segments is a generator, so transcription starts when you iterate over it or convert it to a list. The faster-whisper README documents this behavior.

Reference:

If faster-whisper fails with CUDA/cuDNN/cuBLAS errors, that is a different layer from the pyannote import failure.


4. Then combine diarization and transcription

Once both pyannote and faster-whisper work independently, then debug the speaker-attributed transcript logic.

The next hard problem is usually timestamp reconciliation:

diarization:
SPEAKER_00 from 10.0s to 14.8s

transcription:
"yeah that makes sense" from 13.9s to 16.2s

You need a policy for assigning transcript text to speakers.

Common policies:

Policy Meaning Tradeoff
Midpoint assignment Assign a transcript segment to the speaker active at the segment midpoint. Simple, but weak for long segments that cross speaker changes.
Maximum overlap Assign the transcript segment to the speaker with the largest time overlap. Usually a good first implementation.
Split at speaker boundaries Split transcript segments when diarization changes speakers. More accurate, more code.
Word-level assignment Use word timestamps and assign each word separately. Best when word timestamps are reliable.
Exclusive diarization Prefer non-overlapping diarization when available. Easier to reconcile with transcript timestamps.

For a first robust script, I would use maximum overlap at the segment level. Later, if transcript quality matters a lot, move to word-level assignment.


Why not just monkey-patch pyannote?

Some workarounds patch the installed pyannote file and replace something like:

) -> torchaudio.AudioMetaData:

with:

) -> object:

or a quoted annotation.

That can work temporarily because the failing reference is often annotation-related. But I would not keep that as the real solution.

Reasons:

  • it modifies files inside site-packages;
  • uv can rebuild the environment and erase the patch;
  • it hides the real version mismatch;
  • another removed TorchAudio API may fail later;
  • it makes the environment non-reproducible;
  • it is harder to explain or maintain.

For this script, version pinning is cleaner.


Likely next errors after this fix

After you fix AudioMetaData, you may hit another layer. That is normal.

Possible next error: Hugging Face model access

If you load pyannote models from Hugging Face, you may need to accept model conditions and use a token.

Possible symptoms:

401 Unauthorized
403 Forbidden
Repository not found
You are not in the authorized list

Useful links:

Possible next error: TorchCodec / FFmpeg

If you migrate toward newer pyannote or use TorchCodec-backed decoding, you may see errors like:

Could not load libtorchcodec
FFmpeg is not properly installed

Useful links:

Possible next error: CUDA unavailable

If:

torch.cuda.is_available()

returns:

False

then the import problem is fixed, but your PyTorch install is not seeing the GPU. That is a PyTorch wheel/index/CUDA issue.

Useful link:

Possible next error: faster-whisper CUDA/cuDNN/CTranslate2

Possible symptoms:

Library libcudnn_ops_infer.so not found
CUDA failed
unsupported compute type

Useful link:


Final recommended script header

For your current script, I would start with this:

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.10,<3.14"
# dependencies = [
#   "faster-whisper",
#   "numpy",
#   "pyannote.audio==3.4.0",
#   "scikit-learn",
#   "torch==2.8.0",
#   "torchaudio==2.8.0",
#   "torchcodec==0.7.*",
#   "omegaconf",
#   "brouhaha @ file:///home/user/diarization/repos/.venv/brouhaha-vad",
# ]
# ///

Then run:

uv run --refresh --script diaritranscribe3.py

Then inspect:

uv tree --script diaritranscribe3.py

Then lock:

uv lock --script diaritranscribe3.py

Bottom line

The fix is not to install a random missing package. The fix is to choose a coherent version family.

For your current script, the most practical recovery family is:

pyannote.audio 3.4.0
torch 2.8.0
torchaudio 2.8.0
torchcodec 0.7.x
Python 3.12

That combination is aimed at the exact failure:

AttributeError: module 'torchaudio' has no attribute 'AudioMetaData'

After that works, debug CUDA, faster-whisper, model access, FFmpeg/TorchCodec, and speaker-transcript alignment as separate layers.

Short summary

  • The error is almost certainly a dependency compatibility issue.
  • torchaudio.AudioMetaData existed in TorchAudio 2.8 but was removed in TorchAudio 2.9.
  • Your dependency block lets uv resolve incompatible versions.
  • Pin pyannote.audio==3.4.0, torch==2.8.0, torchaudio==2.8.0, and torchcodec==0.7.*.
  • Remove the mixed manual nvidia-* packages for the first recovery attempt.
  • Test pyannote import by itself.
  • Use uv run --refresh --script ....
  • Inspect with uv tree --script ....
  • Lock with uv lock --script ....
  • Consider pyannote 4.x / community-1 later as a real migration, not the first fix.

I’ve implemented up to step 4, but then I got a couple new errors:
First one was straight foward as it was the error:

  × No solution found when resolving script dependencies:
  ╰─▶ Because only brouhaha==0.9.0 is available and brouhaha==0.9.0 depends on pyannote-audio==3.3.0,
      we can conclude that all versions of brouhaha depend on pyannote-audio==3.3.0.
      And because you require pyannote-audio==3.4.0 and brouhaha, we can conclude that your
      requirements are unsatisfiable.

But once I fixed that(pyannote-audio==3.4.0->pyannote-audio==3.3.0 in dependencies), I got an error

/home/rodrigo/.cache/uv/environments-v2/diaritranscribe3-3f9949c47f20e532/lib/python3.12/site-packages/pyannote/audio/core/io.py:212: UserWarning: torchaudio._backend.list_audio_backends has been deprecated. This deprecation is part of a large refactoring effort to transition TorchAudio into a maintenance phase. The decoding and encoding capabilities of PyTorch for both audio and video are being consolidated into TorchCodec. Please see https://github.com/pytorch/audio/issues/3902 for more information. It will be removed from the 2.9 release. 
  torchaudio.list_audio_backends()
Loading diarization pipeline pyannote/speaker-diarization-community-1...
Traceback (most recent call last):
  File "/home/user/diarization/repos/scripts/diaritranscribe3.py", line 621, in <module>
    main()
  File "/home/user/diarization/repos/scripts/diaritranscribe3.py", line 589, in main
    diarization = diarize_audio(
                  ^^^^^^^^^^^^^^
  File "/home/user/diarization/repos/scripts/diaritranscribe3.py", line 208, in diarize_audio
    pipeline = Pipeline.from_pretrained(MODEL_ID, token=tokens["diarization"])
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Pipeline.from_pretrained() got an unexpected keyword argument 'token'

That seems to stick around even when I revert the changes I’ve made to diaritranscribe3.py

Maybe that new issue is likely a compatibility problem on the pyannote side.
I don’t have much personal experience with pyannote myself, but I have used it while investigating migration issues. It’s a library very version sensitive where the usage itself tends to change significantly with each version update.

This isn’t limited to pyannote, but when updating libraries that are close to the backend, it’s best to proceed on the assumption that you’ll need to rewrite whole the model configurations and related execution code slightly within your scripts:


New errors after pinning pyannote/TorchAudio: causes and fixes

Short version

You made progress.

The original problem was:

AttributeError: module 'torchaudio' has no attribute 'AudioMetaData'

That was the TorchAudio 2.9+ compatibility problem. Pinning back to the Torch 2.8 / TorchAudio 2.8 generation gets you past that layer.

Now you have a different problem:

TypeError: Pipeline.from_pretrained() got an unexpected keyword argument 'token'

This is not the same error. This one is a pyannote API mismatch.

Your dependency resolver says:

brouhaha==0.9.0 depends on pyannote-audio==3.3.0

So your environment is now effectively pinned to:

pyannote.audio==3.3.0

But your code is calling pyannote like this:

pipeline = Pipeline.from_pretrained(MODEL_ID, token=tokens["diarization"])

and it is loading:

pyannote/speaker-diarization-community-1

That is the newer pyannote 4.x / Community-1 style. It does not match the pyannote.audio==3.3.0 API that brouhaha forces.

The immediate fix is:

MODEL_ID = "pyannote/speaker-diarization-3.1"

pipeline = Pipeline.from_pretrained(
    MODEL_ID,
    use_auth_token=tokens["diarization"],
)

Do not use token= with pyannote.audio==3.3.0.

Do not use speaker-diarization-community-1 while you are on the brouhaha / pyannote 3.3 recovery path.

Useful references:


What caused the first new error?

You got this resolver error:

× No solution found when resolving script dependencies:
╰─▶ Because only brouhaha==0.9.0 is available and brouhaha==0.9.0 depends on pyannote-audio==3.3.0,
    we can conclude that all versions of brouhaha depend on pyannote-audio==3.3.0.
    And because you require pyannote-audio==3.4.0 and brouhaha, we can conclude that your
    requirements are unsatisfiable.

This means uv is doing the correct thing.

You asked for:

pyannote-audio==3.4.0

but your local brouhaha package requires:

pyannote-audio==3.3.0

Those two cannot both be true.

So changing:

pyannote-audio==3.4.0

to:

pyannote-audio==3.3.0

was a reasonable fix.

But that change has an important consequence:

You are now on the pyannote 3.3 API.

That means the rest of the code must also use the pyannote 3.3 call style.


What caused the second new error?

You then got:

Loading diarization pipeline pyannote/speaker-diarization-community-1...
Traceback (most recent call last):
  File "/home/user/diarization/repos/scripts/diaritranscribe3.py", line 621, in <module>
    main()
  File "/home/user/diarization/repos/scripts/diaritranscribe3.py", line 589, in main
    diarization = diarize_audio(
                  ^^^^^^^^^^^^^^
  File "/home/user/diarization/repos/scripts/diaritranscribe3.py", line 208, in diarize_audio
    pipeline = Pipeline.from_pretrained(MODEL_ID, token=tokens["diarization"])
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Pipeline.from_pretrained() got an unexpected keyword argument 'token'

The key line is:

Pipeline.from_pretrained(MODEL_ID, token=tokens["diarization"])

The token= keyword is the newer call style. It appears in current Community-1 examples.

But pyannote.audio==3.3.0 expects the older keyword:

use_auth_token=

So this:

pipeline = Pipeline.from_pretrained(
    MODEL_ID,
    token=tokens["diarization"],
)

should become this:

pipeline = Pipeline.from_pretrained(
    MODEL_ID,
    use_auth_token=tokens["diarization"],
)

That is the direct fix for the unexpected keyword argument 'token' error.


The model ID is probably wrong for this recovery path too

Your log says:

Loading diarization pipeline pyannote/speaker-diarization-community-1...

That is another mismatch.

For pyannote.audio==3.3.0, use:

MODEL_ID = "pyannote/speaker-diarization-3.1"

not:

MODEL_ID = "pyannote/speaker-diarization-community-1"

The speaker-diarization-community-1 pipeline belongs to the newer pyannote 4.x era. It is documented with token=..., output.speaker_diarization, and output.exclusive_speaker_diarization.

The pyannote 3.3 path is different. It uses speaker-diarization-3.1, use_auth_token=..., and the returned object is usually iterated with:

for turn, _, speaker in diarization.itertracks(yield_label=True):
    ...

References:


The TorchAudio warning is expected

This warning:

/home/rodrigo/.cache/uv/environments-v2/diaritranscribe3-3f9949c47f20e532/lib/python3.12/site-packages/pyannote/audio/core/io.py:212: UserWarning: torchaudio._backend.list_audio_backends has been deprecated. This deprecation is part of a large refactoring effort to transition TorchAudio into a maintenance phase. The decoding and encoding capabilities of PyTorch for both audio and video are being consolidated into TorchCodec. Please see https://github.com/pytorch/audio/issues/3902 for more information. It will be removed from the 2.9 release.
  torchaudio.list_audio_backends()

is not the current crash.

It means:

pyannote.audio 3.3.0 is calling an old TorchAudio API.
TorchAudio 2.8 still has that API, but warns that it will disappear in 2.9.

That warning is exactly why you should not upgrade TorchAudio to 2.9 in this recovery path.

Keep:

torch==2.8.0
torchaudio==2.8.0

TorchAudio 2.8 warns. TorchAudio 2.9 removes. For old pyannote code, a warning is better than a missing attribute crash.

Relevant references:


Recommended current fix

Use the pyannote 3.3-compatible dependency set

Given your brouhaha constraint, use this dependency block:

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.10,<3.14"
# dependencies = [
#   "faster-whisper",
#   "numpy",
#   "pyannote.audio==3.3.0",
#   "scikit-learn",
#   "torch==2.8.0",
#   "torchaudio==2.8.0",
#   "torchcodec==0.7.*",
#   "omegaconf",
#   "brouhaha @ file:///home/user/diarization/repos/.venv/brouhaha-vad",
# ]
# ///

Why:

Package Reason
pyannote.audio==3.3.0 Required by your local brouhaha==0.9.0 package.
torch==2.8.0 Coherent with TorchAudio 2.8 and TorchCodec 0.7.
torchaudio==2.8.0 Keeps deprecated APIs available instead of removed.
torchcodec==0.7.* TorchCodec’s compatibility table maps 0.7 to Torch 2.8.
faster-whisper Keep it for transcription, but debug it separately from pyannote.
No manual nvidia-* packages Avoid mixing CUDA generations while fixing pyannote import and model loading.

Useful references:


Recommended code patch

Find your current code around line 208:

pipeline = Pipeline.from_pretrained(MODEL_ID, token=tokens["diarization"])

Change it to:

pipeline = Pipeline.from_pretrained(
    MODEL_ID,
    use_auth_token=tokens["diarization"],
)

Also change the model ID.

If you currently have:

MODEL_ID = "pyannote/speaker-diarization-community-1"

change it to:

MODEL_ID = "pyannote/speaker-diarization-3.1"

A compact pyannote 3.3-compatible function would look like:

from pyannote.audio import Pipeline
import torch

MODEL_ID = "pyannote/speaker-diarization-3.1"

def diarize_audio(audio_path, tokens):
    print(f"Loading diarization pipeline {MODEL_ID}...")

    pipeline = Pipeline.from_pretrained(
        MODEL_ID,
        use_auth_token=tokens["diarization"],
    )

    if torch.cuda.is_available():
        pipeline.to(torch.device("cuda"))

    diarization = pipeline(audio_path)

    return diarization

Then, when reading the result:

for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"{turn.start:.2f} {turn.end:.2f} {speaker}")

This matches the pyannote 3.x style.


Why it still happens after “reverting” the script

There are a few likely reasons.

1. You changed the environment, not just the file

Even if you revert part of diaritranscribe3.py, your dependency environment still contains:

pyannote.audio==3.3.0

because brouhaha requires it.

So token= will keep failing until the code matches pyannote 3.3.

Check the actual runtime version:

from importlib.metadata import version

print("pyannote.audio:", version("pyannote.audio"))

Expected now:

pyannote.audio: 3.3.0

If that is the version, use:

use_auth_token=

not:

token=

2. Your MODEL_ID may still point to Community-1

Search your script:

grep -n "speaker-diarization" diaritranscribe3.py

For the recovery path, it should show:

pyannote/speaker-diarization-3.1

not:

pyannote/speaker-diarization-community-1

3. Your script may still contain token=

Search:

grep -n "token=" diaritranscribe3.py

For the pyannote call, change:

token=tokens["diarization"]

to:

use_auth_token=tokens["diarization"]

Do not necessarily change every token= in the whole script. Other libraries may still use a token keyword. The specific problem is the pyannote 3.3 call to Pipeline.from_pretrained.


4. uv may be reusing a cached script environment

Use refresh while testing:

uv run --refresh --script diaritranscribe3.py

Then inspect the dependency tree:

uv tree --script diaritranscribe3.py

You want to see something close to:

pyannote.audio==3.3.0
torch==2.8.0
torchaudio==2.8.0
torchcodec==0.7.x

Once it works, lock it:

uv lock --script diaritranscribe3.py

Reference:


Two coherent paths from here

Path A — recommended now: stay with brouhaha and pyannote 3.3

Choose this if your priority is to get the current script working.

Use:

pyannote.audio==3.3.0
torch==2.8.0
torchaudio==2.8.0
torchcodec==0.7.*

Use model:

MODEL_ID = "pyannote/speaker-diarization-3.1"

Use auth keyword:

use_auth_token=tokens["diarization"]

Use output iteration:

for turn, _, speaker in diarization.itertracks(yield_label=True):
    ...

This is the low-risk recovery path because it respects the brouhaha dependency pin.


Path B — later migration: use Community-1 and pyannote 4.x

Choose this if you want the newer pyannote stack and are willing to deal with migration work.

You would need to remove or modify the brouhaha constraint first. Options:

  1. Remove brouhaha.
  2. Replace brouhaha with another VAD path.
  3. Fork/edit your local brouhaha package so it does not require pyannote-audio==3.3.0.
  4. Update brouhaha, if a newer compatible version exists in your local project.
  5. Split the environment so brouhaha and modern pyannote are not forced into the same dependency graph.

Then you can move toward:

pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-community-1",
    token=tokens["diarization"],
)

and newer output handling:

output = pipeline(audio_path)

for turn, speaker in output.speaker_diarization:
    print(turn.start, turn.end, speaker)

# If available and useful for transcript alignment:
for turn, speaker in output.exclusive_speaker_diarization:
    print(turn.start, turn.end, speaker)

But treat this as a real migration. It may involve:

  • TorchCodec;
  • FFmpeg;
  • newer pyannote output objects;
  • new model access requirements;
  • possibly higher VRAM use;
  • different diarization output behavior;
  • changes to transcript/speaker alignment code.

Useful references:


Immediate diagnostic checklist

Run these in order.

1. Confirm versions

Add this temporarily near the top of the script:

from importlib.metadata import version
import torch
import torchaudio

print("pyannote.audio:", version("pyannote.audio"))
print("torch:", torch.__version__)
print("torchaudio:", torchaudio.__version__)
print("torchcodec:", version("torchcodec"))
print("AudioMetaData exists:", hasattr(torchaudio, "AudioMetaData"))

Expected for the recovery path:

pyannote.audio: 3.3.0
torch: 2.8.0...
torchaudio: 2.8.0...
torchcodec: 0.7...
AudioMetaData exists: True

If torchaudio is 2.9.x, you are back in the danger zone.


2. Confirm model ID

For Path A, use:

MODEL_ID = "pyannote/speaker-diarization-3.1"

not:

MODEL_ID = "pyannote/speaker-diarization-community-1"

3. Confirm auth keyword

For Path A, use:

pipeline = Pipeline.from_pretrained(
    MODEL_ID,
    use_auth_token=tokens["diarization"],
)

not:

pipeline = Pipeline.from_pretrained(
    MODEL_ID,
    token=tokens["diarization"],
)

4. Confirm access to gated models

For speaker-diarization-3.1, make sure the Hugging Face account behind your token has accepted the relevant model conditions.

Common symptoms of missing access are different from your current error. They look more like:

401 Unauthorized
403 Forbidden
Repository not found
Could not download pipeline

Useful links:


5. Refresh uv while testing

uv run --refresh --script diaritranscribe3.py

Then inspect:

uv tree --script diaritranscribe3.py

Then lock after success:

uv lock --script diaritranscribe3.py

What not to do right now

Do not upgrade TorchAudio to silence the warning.

This warning:

torchaudio._backend.list_audio_backends has been deprecated

does not mean:

upgrade torchaudio

In this case it means:

you are using legacy pyannote code that still works on TorchAudio 2.8, but will break on TorchAudio 2.9

So for the recovery path, keep:

torchaudio==2.8.0

Do not switch back to:

token=tokens["diarization"]

unless you migrate to a pyannote version that supports it.

Do not use:

pyannote/speaker-diarization-community-1

unless you deliberately move to the newer pyannote 4.x path.

Do not reintroduce mixed CUDA packages while debugging this pyannote problem. CUDA can be debugged after pyannote loads.


Final recommended state for your current script

Use this dependency block:

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.10,<3.14"
# dependencies = [
#   "faster-whisper",
#   "numpy",
#   "pyannote.audio==3.3.0",
#   "scikit-learn",
#   "torch==2.8.0",
#   "torchaudio==2.8.0",
#   "torchcodec==0.7.*",
#   "omegaconf",
#   "brouhaha @ file:///home/user/diarization/repos/.venv/brouhaha-vad",
# ]
# ///

Use this model ID:

MODEL_ID = "pyannote/speaker-diarization-3.1"

Use this load call:

pipeline = Pipeline.from_pretrained(
    MODEL_ID,
    use_auth_token=tokens["diarization"],
)

Use this output style:

for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"{turn.start:.2f} {turn.end:.2f} {speaker}")

Then run:

uv run --refresh --script diaritranscribe3.py

Inspect:

uv tree --script diaritranscribe3.py

Lock:

uv lock --script diaritranscribe3.py

Bottom line

You now have two separate compatibility boundaries:

  1. TorchAudio boundary
    Old pyannote code needs TorchAudio 2.8, not 2.9+.
    The deprecation warning is expected. Keep torchaudio==2.8.0.

  2. pyannote API boundary
    brouhaha forces pyannote.audio==3.3.0.
    That means use speaker-diarization-3.1 and use_auth_token=..., not speaker-diarization-community-1 and token=....

So the practical fix is:

pyannote.audio==3.3.0
torch==2.8.0
torchaudio==2.8.0
torchcodec==0.7.*
MODEL_ID = "pyannote/speaker-diarization-3.1"
Pipeline.from_pretrained(..., use_auth_token=...)

Treat community-1 + token=... as a later migration path after you remove, update, or fork the brouhaha dependency that pins pyannote to 3.3.0.

Thank you so much for helping me so far! I love the detailed, step by step explanations and I hope those can help other people with similar problems.

I’ve implemented

MODEL_ID = "pyannote/speaker-diarization-3.1"
pipeline = Pipeline.from_pretrained(
    MODEL_ID,
    use_auth_token=tokens["diarization"],
)

but now I am running into an issue of

File "/home/user/diarization/repos/scripts/diaritranscribe3.py", line 208, in diarize_audio
    pipeline = Pipeline.from_pretrained(MODEL_ID, use_auth_token=tokens["diarization"])
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.cache/uv/environments-v2/diaritranscribe3-3f9949c47f20e532/lib/python3.12/site-packages/pyannote/audio/core/pipeline.py", line 89, in from_pretrained
    config_yml = hf_hub_download(
                 ^^^^^^^^^^^^^^^^
  File "/home/user/.cache/uv/environments-v2/diaritranscribe3-3f9949c47f20e532/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 88, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
TypeError: hf_hub_download() got an unexpected keyword argument 'use_auth_token'

Where it doesn’t recognize the use_auth_token= not being recognized as an argument

Looking ahead, updating the library is really the best course of action, but given your current setup, the migration process is quite complicated:


Path B — later migration: use Community-1 and pyannote.audio 4.x

Short version

Path B means intentionally leaving the old pyannote.audio==3.3.0 recovery stack and moving to the newer pyannote stack:

pyannote.audio 4.x
pyannote/speaker-diarization-community-1
Pipeline.from_pretrained(..., token=...)
output.speaker_diarization
output.exclusive_speaker_diarization
TorchCodec-backed audio decoding
FFmpeg installed

This is not just a one-line model change.

It is a real migration because your current brouhaha dependency pins:

pyannote-audio==3.3.0

while the newer Community-1 examples expect the newer pyannote API surface:

Pipeline.from_pretrained(
    "pyannote/speaker-diarization-community-1",
    token="<HUGGINGFACE_ACCESS_TOKEN>",
)

The current pyannote README shows this community-1 + token=... style and says FFmpeg must be installed because TorchCodec handles audio decoding:


Why you should not do Path B casually

Your current stack has two separate constraints:

brouhaha==0.9.0
        ↓
requires pyannote-audio==3.3.0

and:

Community-1 / pyannote 4.x examples
        ↓
use token=...
use output.speaker_diarization
use output.exclusive_speaker_diarization
expect TorchCodec/FFmpeg audio decoding

Those are different worlds.

The pyannote 3.3 recovery world uses:

pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token="<HUGGINGFACE_ACCESS_TOKEN>",
)

diarization = pipeline("audio.wav")

for turn, _, speaker in diarization.itertracks(yield_label=True):
    ...

The pyannote 4 / Community-1 world uses:

pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-community-1",
    token="<HUGGINGFACE_ACCESS_TOKEN>",
)

output = pipeline("audio.wav")

for turn, speaker in output.speaker_diarization:
    ...

And, when available, the newer path also gives:

output.exclusive_speaker_diarization

That exclusive_speaker_diarization output is especially relevant for your transcription project because the Community-1 model card describes it as simplifying reconciliation between diarization timestamps and transcription timestamps.

Source links:


What Path B is for

Choose Path B if you want one or more of these:

  • newer pyannote.audio API;
  • the open-source pyannote/speaker-diarization-community-1 pipeline;
  • better diarization quality than the old speaker-diarization-3.1 baseline;
  • easier reconciliation with transcripts using exclusive_speaker_diarization;
  • a forward-looking stack instead of living on TorchAudio 2.8 deprecation warnings;
  • a cleaner long-term project layout.

Do not choose Path B if your immediate goal is only:

make the old script run with the least changes

For the least-change recovery path, stay with:

pyannote.audio==3.3.0
pyannote/speaker-diarization-3.1
use_auth_token=...
torch==2.8.0
torchaudio==2.8.0
torchcodec==0.7.*

Path B is the better long-term migration, but the worse emergency fix.


The main blocker: brouhaha

The problem

Your resolver already told you:

brouhaha==0.9.0 depends on pyannote-audio==3.3.0

So this cannot work:

"pyannote.audio>=4,<5",
"brouhaha @ file:///home/user/diarization/repos/.venv/brouhaha-vad",

unless you change something about brouhaha.

The resolver is correct. If brouhaha requires exactly:

pyannote-audio==3.3.0

then the environment cannot also contain:

pyannote.audio>=4

Your options

You have five realistic choices.

Option What it means Good if Risk
Remove brouhaha Delete it from dependencies and remove/replace its VAD calls. You do not strictly need Brouhaha VAD. You may lose the current VAD behavior.
Replace brouhaha Use pyannote’s own diarization behavior, faster-whisper VAD, Silero VAD, or another VAD stage. You only used Brouhaha as a helper. May change segmentation and final transcript quality.
Fork/edit brouhaha Change its dependency metadata from pyannote-audio==3.3.0 to a looser or newer version. You control the local package and can test it. Its code may actually depend on pyannote 3.3 internals.
Split environments Run Brouhaha preprocessing in one script/env, then run pyannote 4 diarization in another script/env. You need Brouhaha but also want Community-1. More moving parts and file handoff.
Stay on Path A Do not migrate now. Keep pyannote 3.3. You want stability first. You do not get Community-1 yet.

My recommendation: do not start by editing brouhaha dependency metadata blindly.

First inspect why it pins pyannote:

grep -R "pyannote" -n /home/user/diarization/repos/.venv/brouhaha-vad

Look for files like:

pyproject.toml
setup.py
setup.cfg
requirements.txt

Then inspect imports:

grep -R "from pyannote\|import pyannote" -n /home/user/diarization/repos/.venv/brouhaha-vad

If Brouhaha only uses public, stable APIs, loosening the pin might work. If it uses pyannote internals or pyannote 3.x-specific output structures, expect breakage.


Recommended migration strategy

Do not migrate the production script all at once.

Use a three-stage migration.

Stage 1: build a tiny Community-1 proof-of-life script
Stage 2: port only diarization code
Stage 3: reintegrate transcription, VAD, and speaker-label alignment

This prevents one common failure mode:

changed model + changed pyannote version + changed TorchCodec + changed FFmpeg + changed CUDA + changed VAD + changed transcript alignment
        ↓
too many variables
        ↓
impossible to tell what broke

Stage 1 — prove Community-1 works by itself

Create a new test file, separate from diaritranscribe3.py.

For example:

check_pyannote4_community1.py

Use this as a minimal proof-of-life script:

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.10,<3.14"
# dependencies = [
#   "pyannote.audio>=4,<5",
#   "torch",
#   "torchaudio",
#   "torchcodec",
# ]
# ///

import os
from importlib.metadata import version

import torch
from pyannote.audio import Pipeline
from pyannote.audio.pipelines.utils.hook import ProgressHook

MODEL_ID = "pyannote/speaker-diarization-community-1"
AUDIO_PATH = "audio.wav"

token = os.environ.get("HF_TOKEN")
if not token:
    raise RuntimeError("Set HF_TOKEN before running this script.")

print("pyannote.audio:", version("pyannote.audio"))
print("torch:", torch.__version__)
print("torch cuda build:", torch.version.cuda)
print("cuda available:", torch.cuda.is_available())
print("torchaudio:", version("torchaudio"))
print("torchcodec:", version("torchcodec"))

pipeline = Pipeline.from_pretrained(
    MODEL_ID,
    token=token,
)

if torch.cuda.is_available():
    pipeline.to(torch.device("cuda"))

with ProgressHook() as hook:
    output = pipeline(AUDIO_PATH, hook=hook)

print("\nRegular diarization:")
for turn, speaker in output.speaker_diarization:
    print(f"{turn.start:.3f}\t{turn.end:.3f}\t{speaker}")

print("\nExclusive diarization:")
if hasattr(output, "exclusive_speaker_diarization"):
    for turn, speaker in output.exclusive_speaker_diarization:
        print(f"{turn.start:.3f}\t{turn.end:.3f}\t{speaker}")
else:
    print("exclusive_speaker_diarization is not available on this output.")

Run it like:

export HF_TOKEN="<HUGGINGFACE_ACCESS_TOKEN>"
uv run --refresh --script check_pyannote4_community1.py

In normal prose, write the token placeholder as \<HUGGINGFACE_ACCESS_TOKEN\>.

Before running it, make sure:

  1. you accepted the Community-1 user conditions;
  2. your token can access the model;
  3. FFmpeg is installed;
  4. the test file audio.wav exists.

Relevant setup docs:


Stage 2 — choose a coherent Torch/TorchCodec version family

The current pyannote project metadata says the modern branch requires:

Python >=3.10
torch >=2.8.0
torchaudio >=2.8.0
torchcodec >=0.7.0

Source:

But “greater than or equal” does not mean every arbitrary combination is equally good.

TorchCodec publishes a compatibility table. Current table highlights include:

torchcodec 0.7  ↔ torch 2.8
torchcodec 0.8  ↔ torch 2.9
torchcodec 0.9  ↔ torch 2.9
torchcodec 0.10 ↔ torch 2.10
torchcodec 0.11 ↔ torch 2.11

Source:

So do not mix randomly.

Conservative modern family

This is the least aggressive Community-1 migration target:

pyannote.audio>=4,<5
torch==2.8.0
torchaudio==2.8.0
torchcodec==0.7.*

Pros:

  • close to the minimum modern pyannote requirements;
  • avoids jumping all the way to newer Torch/TorchAudio generations;
  • TorchCodec 0.7 matches Torch 2.8;
  • likely easier if the rest of your audio stack was stabilized around Torch 2.8.

Cons:

  • still close to the old TorchAudio transition boundary;
  • may not represent the newest pyannote-tested stack.

Newer Torch family

A newer family might look like:

pyannote.audio>=4,<5
torch==2.9.*
torchaudio==2.9.*
torchcodec==0.9.*

or:

pyannote.audio>=4,<5
torch==2.10.*
torchaudio==2.10.*
torchcodec==0.10.*

Pros:

  • more aligned with the post-TorchAudio-2.9 world;
  • better long-term direction if your other dependencies support it.

Cons:

  • may expose TorchCodec/FFmpeg issues;
  • may conflict with faster-whisper/CTranslate2 expectations;
  • may require more careful PyTorch CUDA wheel/index selection.

Practical advice

For a migration branch, start with the conservative modern family:

"pyannote.audio>=4,<5",
"torch==2.8.0",
"torchaudio==2.8.0",
"torchcodec==0.7.*",

Then, after Community-1 works, decide whether to move Torch upward.

Do not solve every modernization problem at once.


Stage 3 — remove or isolate brouhaha

Because brouhaha pins pyannote 3.3, your Community-1 test script should not include Brouhaha.

For Path B, the dependency block should start without it:

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.10,<3.14"
# dependencies = [
#   "pyannote.audio>=4,<5",
#   "torch==2.8.0",
#   "torchaudio==2.8.0",
#   "torchcodec==0.7.*",
# ]
# ///

Only after Community-1 works should you decide what to do with Brouhaha.

If you remove Brouhaha

Delete:

"brouhaha @ file:///home/user/diarization/repos/.venv/brouhaha-vad",

and remove code like:

import brouhaha

or any function calls into Brouhaha.

Then rely on pyannote diarization directly, or use another VAD/preprocessing layer.

If you fork Brouhaha

Edit its dependency metadata.

For example, if its pyproject.toml contains:

dependencies = [
    "pyannote-audio==3.3.0",
]

you could test:

dependencies = [
    "pyannote-audio>=4,<5",
]

or, if Brouhaha does not actually need pyannote at runtime after your refactor:

dependencies = []

But do this only in a branch or copy.

Then run its own tests, or at least import it:

uv run --refresh --script check_brouhaha_import.py

where:

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.10,<3.14"
# dependencies = [
#   "brouhaha @ file:///home/user/diarization/repos/.venv/brouhaha-vad",
#   "pyannote.audio>=4,<5",
# ]
# ///

import brouhaha
from importlib.metadata import version

print("brouhaha import OK")
print("pyannote.audio:", version("pyannote.audio"))

If this fails, Brouhaha is not pyannote-4-compatible yet.

If you split environments

Use two scripts.

First script:

vad_preprocess.py

uses Brouhaha and pyannote 3.3 if needed.

Second script:

diarize_community1.py

uses pyannote 4 and Community-1.

The handoff should be a file, JSON, RTTM, or plain timestamp list. This is clunkier, but it avoids forcing incompatible libraries into one dependency graph.


Stage 4 — update the pyannote call

Old Path A code:

pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token=tokens["diarization"],
)

diarization = pipeline(audio_path)

for turn, _, speaker in diarization.itertracks(yield_label=True):
    ...

New Path B code:

pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-community-1",
    token=tokens["diarization"],
)

output = pipeline(audio_path)

for turn, speaker in output.speaker_diarization:
    ...

And, for transcript alignment, prefer testing:

for turn, speaker in output.exclusive_speaker_diarization:
    ...

The current Community-1 model card says exclusive_speaker_diarization is provided on top of regular diarization and is meant to simplify reconciliation with transcription timestamps.

Source:


Stage 5 — rewrite speaker/transcript alignment around exclusive diarization

This is the most important practical benefit for your script.

Your final goal is not just diarization. Your goal is:

audio file
        ↓
transcript segments or words
        ↓
speaker labels
        ↓
speaker-attributed transcript

Old diarization can produce fine-grained, overlapping, or awkward speaker turns. That can be hard to align to Whisper/faster-whisper transcript segments.

Community-1 adds:

output.exclusive_speaker_diarization

Use that first for transcript alignment.

Basic maximum-overlap assignment

Use this when your ASR gives segment-level timestamps.

def overlap_seconds(a_start, a_end, b_start, b_end):
    return max(0.0, min(a_end, b_end) - max(a_start, b_start))


def assign_speaker_to_segment(segment_start, segment_end, diarization_turns):
    best_speaker = None
    best_overlap = 0.0

    for turn_start, turn_end, speaker in diarization_turns:
        overlap = overlap_seconds(segment_start, segment_end, turn_start, turn_end)
        if overlap > best_overlap:
            best_overlap = overlap
            best_speaker = speaker

    return best_speaker or "UNKNOWN"


def diarization_to_turns(exclusive_speaker_diarization):
    turns = []
    for turn, speaker in exclusive_speaker_diarization:
        turns.append((float(turn.start), float(turn.end), str(speaker)))
    return turns

Then:

turns = diarization_to_turns(output.exclusive_speaker_diarization)

for segment in whisper_segments:
    speaker = assign_speaker_to_segment(segment.start, segment.end, turns)
    print(f"[{segment.start:.2f}-{segment.end:.2f}] {speaker}: {segment.text}")

Word-level assignment

If faster-whisper returns word timestamps, word-level assignment is usually better.

Conceptually:

for each word:
    find the speaker turn with max overlap
    assign that speaker to the word
then merge adjacent words with the same speaker

This handles speaker changes inside a long ASR segment better than assigning one speaker to the whole segment.


Stage 6 — verify FFmpeg and TorchCodec

Community-1 uses TorchCodec-backed decoding. The pyannote README explicitly says FFmpeg must be installed because TorchCodec handles audio decoding.

Check FFmpeg:

ffmpeg -version

Check TorchCodec import:

import torchcodec
print("torchcodec import OK")

Check versions:

from importlib.metadata import version
import torch

print("torch:", torch.__version__)
print("torchcodec:", version("torchcodec"))

TorchCodec supports FFmpeg major versions in [4, 8], and on Windows it needs FFmpeg builds with separate shared libraries. The TorchCodec README also provides the TorchCodec/Torch/Python compatibility table.

Source:

If TorchCodec fails

Common error shapes:

RuntimeError: Could not load libtorchcodec
FFmpeg is not properly installed
No compatible FFmpeg found

Likely causes:

  • FFmpeg missing;
  • FFmpeg installed but not visible on PATH;
  • Windows FFmpeg build is not a shared build;
  • TorchCodec version does not match Torch version;
  • Python version is outside the wheel’s supported range;
  • unsupported architecture, especially Linux ARM64/aarch64.

Check the compatibility table before changing random packages.


Stage 7 — choose uv layout: inline script vs project

You can do Path B with inline script metadata, but a project layout is cleaner once you are juggling:

pyannote.audio
torch
torchaudio
torchcodec
faster-whisper
ctranslate2
ffmpeg
CUDA
tokens
local packages

Inline script version

Good for quick experiments:

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.10,<3.14"
# dependencies = [
#   "pyannote.audio>=4,<5",
#   "torch==2.8.0",
#   "torchaudio==2.8.0",
#   "torchcodec==0.7.*",
# ]
# ///

from pyannote.audio import Pipeline

Lock after success:

uv lock --script check_pyannote4_community1.py

Source:

Project version

Better for the real app.

pyproject.toml:

[project]
name = "diaritranscribe"
version = "0.1.0"
requires-python = ">=3.10,<3.14"
dependencies = [
  "pyannote.audio>=4,<5",
  "faster-whisper",
  "numpy",
  "scikit-learn",
  "omegaconf",
  "torch==2.8.0",
  "torchaudio==2.8.0",
  "torchcodec==0.7.*",
]

[tool.uv]
required-version = ">=0.5.3"

Then:

uv lock
uv sync
uv run python scripts/diaritranscribe4.py

If you need explicit CUDA PyTorch indexes, use uv’s PyTorch guide:

PyTorch packaging is unusual because CPU and CUDA builds may live on different indexes and use local version specifiers such as +cpu or +cu130.


Stage 8 — update token handling

Use environment variables rather than hardcoding tokens.

export HF_TOKEN="<HUGGINGFACE_ACCESS_TOKEN>"

Python:

import os

token = os.environ.get("HF_TOKEN")
if not token:
    raise RuntimeError("Set HF_TOKEN.")

Then:

pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-community-1",
    token=token,
)

In normal prose, write the placeholder as \<HUGGINGFACE_ACCESS_TOKEN\>.

Make sure the token’s Hugging Face account has accepted the model conditions:

Missing access usually gives errors like:

401 Unauthorized
403 Forbidden
Repository not found
gated repo

Those are different from the old unexpected keyword argument 'token' error.


Stage 9 — account for telemetry

Current pyannote docs mention optional telemetry. The README says it tracks privacy-preserving information such as pipeline origin, pipeline class, file duration, and speaker-count parameters, and documents ways to control it.

Disable for the current process if desired:

export PYANNOTE_METRICS_ENABLED=0

Or in Python:

from pyannote.audio.telemetry import set_telemetry_metrics

set_telemetry_metrics(False)

Source:


Stage 10 — test accuracy and runtime before deleting Path A

Do not delete the working pyannote 3.3 path until you compare:

  • same audio file;
  • same hardware;
  • same preprocessing;
  • same transcript segments;
  • same speaker-label assignment policy;
  • same output format.

Compare:

speaker count
number of turns
total diarization time
overlap behavior
transcript speaker-label quality
GPU memory use
runtime
failure rate on long files

A migration is successful only if the final speaker-attributed transcript improves or remains acceptable.


Suggested branch layout

Keep two scripts for a while:

diaritranscribe3.py       # recovery path, pyannote 3.3
diaritranscribe4.py       # migration path, pyannote 4 / Community-1

Keep two lockfiles if using inline scripts:

diaritranscribe3.py.lock
diaritranscribe4.py.lock

This prevents accidentally breaking the known-good path while testing the new one.


Minimal diaritranscribe4.py starting point

This is a clean starting point for just the diarization part.

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.10,<3.14"
# dependencies = [
#   "pyannote.audio>=4,<5",
#   "torch==2.8.0",
#   "torchaudio==2.8.0",
#   "torchcodec==0.7.*",
# ]
# ///

import argparse
import os
from importlib.metadata import version

import torch
from pyannote.audio import Pipeline
from pyannote.audio.pipelines.utils.hook import ProgressHook

MODEL_ID = "pyannote/speaker-diarization-community-1"


def print_versions():
    print("pyannote.audio:", version("pyannote.audio"))
    print("torch:", torch.__version__)
    print("torch cuda build:", torch.version.cuda)
    print("cuda available:", torch.cuda.is_available())
    print("torchaudio:", version("torchaudio"))
    print("torchcodec:", version("torchcodec"))


def load_pipeline(token: str):
    pipeline = Pipeline.from_pretrained(
        MODEL_ID,
        token=token,
    )

    if torch.cuda.is_available():
        pipeline.to(torch.device("cuda"))

    return pipeline


def run_diarization(audio_path: str):
    token = os.environ.get("HF_TOKEN")
    if not token:
        raise RuntimeError("Set HF_TOKEN before running this script.")

    print_versions()
    print(f"Loading {MODEL_ID}...")

    pipeline = load_pipeline(token)

    with ProgressHook() as hook:
        output = pipeline(audio_path, hook=hook)

    return output


def print_diarization(output):
    print("\nRegular speaker diarization:")
    for turn, speaker in output.speaker_diarization:
        print(f"{turn.start:.3f}\t{turn.end:.3f}\t{speaker}")

    print("\nExclusive speaker diarization:")
    if hasattr(output, "exclusive_speaker_diarization"):
        for turn, speaker in output.exclusive_speaker_diarization:
            print(f"{turn.start:.3f}\t{turn.end:.3f}\t{speaker}")
    else:
        print("Not available.")


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("audio_path")
    args = parser.parse_args()

    output = run_diarization(args.audio_path)
    print_diarization(output)


if __name__ == "__main__":
    main()

Run:

export HF_TOKEN="<HUGGINGFACE_ACCESS_TOKEN>"
uv run --refresh --script diaritranscribe4.py audio.wav

Lock after it works:

uv lock --script diaritranscribe4.py

Adding faster-whisper back later

After Community-1 works by itself, add faster-whisper back.

# /// script
# requires-python = ">=3.10,<3.14"
# dependencies = [
#   "pyannote.audio>=4,<5",
#   "torch==2.8.0",
#   "torchaudio==2.8.0",
#   "torchcodec==0.7.*",
#   "faster-whisper",
#   "numpy",
#   "scikit-learn",
#   "omegaconf",
# ]
# ///

Then test faster-whisper separately before combining:

from faster_whisper import WhisperModel

model = WhisperModel("small", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.wav", beam_size=5)

for segment in segments:
    print(segment.start, segment.end, segment.text)

If faster-whisper fails with CUDA/cuDNN/CTranslate2 errors, that is separate from pyannote.

Source:


Common Path B failure modes

Failure: No solution found

Usually means you still have a dependency pin like:

brouhaha -> pyannote-audio==3.3.0

Fix:

  • remove Brouhaha from the pyannote 4 environment;
  • fork/update Brouhaha;
  • split environments.

Failure: unexpected keyword argument 'token'

This means you are still on old pyannote.

Check:

from importlib.metadata import version
print(version("pyannote.audio"))

If it prints 3.3.0, you are not on Path B yet.

Failure: unexpected keyword argument 'use_auth_token'

This means you are probably on newer pyannote but still using old code.

Use:

token="<HUGGINGFACE_ACCESS_TOKEN>"

not:

use_auth_token="<HUGGINGFACE_ACCESS_TOKEN>"

Failure: Could not load libtorchcodec

Check:

  • TorchCodec/Torch version compatibility;
  • FFmpeg installation;
  • Python version;
  • platform wheel availability.

Source:

Failure: model access denied

Check that you accepted the model conditions and used a valid token:

Failure: CUDA not available

Check PyTorch install:

import torch
print(torch.__version__)
print(torch.version.cuda)
print(torch.cuda.is_available())

Use uv’s PyTorch guide for accelerator-specific builds:


Recommended Path B checklist

  1. Create diaritranscribe4.py.
  2. Remove brouhaha from that script.
  3. Use pyannote.audio>=4,<5.
  4. Start with a coherent Torch/TorchAudio/TorchCodec family.
  5. Install FFmpeg.
  6. Accept Community-1 model conditions.
  7. Set HF_TOKEN.
  8. Load with token=....
  9. Use output.speaker_diarization.
  10. Prefer output.exclusive_speaker_diarization for transcript alignment.
  11. Test pyannote alone.
  12. Add faster-whisper back only after pyannote works.
  13. Rebuild speaker assignment around maximum overlap or word-level timestamps.
  14. Lock the migrated script.
  15. Keep the pyannote 3.3 script until the new output is verified.

Bottom line

Path B is not:

change speaker-diarization-3.1 to speaker-diarization-community-1

and it is not:

change use_auth_token= to token=

It is:

remove or isolate the Brouhaha pyannote 3.3 pin
        ↓
move to pyannote.audio 4.x
        ↓
use Community-1
        ↓
install/verify TorchCodec and FFmpeg
        ↓
change the output parsing code
        ↓
use exclusive diarization for transcript alignment
        ↓
lock the new environment

For your project, the safest approach is to keep:

diaritranscribe3.py

as the recovery script and create:

diaritranscribe4.py

as the Community-1 migration script.

Do not merge them until Community-1 works alone, faster-whisper works alone, and the speaker-attributed transcript is at least as good as your pyannote 3.3 path.

This technically solved my problem, as rewritting the script around that starting point you made worked
For posterity, the changes I made were:

adding a slight block of code at the start to bypass a couple errors:

def patched_forward(self, sequences, weights=None):
    mean = sequences.mean(dim=-1)
    if sequences.size(-1) > 1:
        std = sequences.std(dim=-1, correction=1)
    else:
        std = torch.zeros_like(mean)
    return torch.cat([mean, std], dim=-1)

StatsPool.forward = patched_forward
torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.allow_tf32 = False

a small change to the assign_speaker_to_segment function to account for multiple segments of the same speaker

def assign_speaker_to_segment(segment_start, segment_end, diarization_turns):
    best_speaker = None
    best_overlap = 0.0
    speakerdict = {}
    for speaker in diarization_turns:
        speakerdict[speaker[2]] = 0.0
    for turn_start, turn_end, speaker in diarization_turns:
        speakerdict[speaker] += overlap_seconds(segment_start, segment_end, turn_start, turn_end)
        overlap = speakerdict[speaker]
        if overlap > best_overlap:
            best_overlap = overlap
            best_speaker = speaker

    return best_speaker or "UNKNOWN"

And a small change to the token function.

Unfortunately, this script is just a cleaner version of the previous iteration of my script, and the current itteration was meant to solve a problem regarding diarization errors themselves. For now, thank you, and I will eventualy open a topic with the next step once I figure out how to formulate the problem.