Broken audio processing

#42
by souflaeeh - opened

Audio processing seems to break for pretty much all use-cases that don't exclusively involve transcription, summarization or translation. For example, "Transcribe this audio" prompts seem to work well but the model points the audio inputs out as unusual with other prompts. Example code adapted from the official audio processing guide:

from transformers import AutoProcessor, AutoModelForImageTextToText

GEMMA_MODEL_ID = "google/gemma-3n-E4B-it"

processor = AutoProcessor.from_pretrained(GEMMA_MODEL_ID)
model = AutoModelForImageTextToText.from_pretrained(
            GEMMA_MODEL_ID, torch_dtype="auto")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "https://ai.google.dev/gemma/docs/audio/roses-are.wav"},
            {"type": "text", "text": "1. Transcribe the audio\n2. Summarize the audio\n3. Did you notice anything unusual about the audio?"},
        ]
    }
]

input_ids = processor.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True, return_dict=True,
        return_tensors="pt",
)
input_ids = input_ids.to(model.device, dtype=model.dtype)

outputs = model.generate(**input_ids, max_new_tokens=400)

text = processor.batch_decode(
    outputs,
    skip_special_tokens=False,
    clean_up_tokenization_spaces=False
)
print(text[0])

Output (user prompt omitted):

<start_of_turn>model
**1. Transcription of the audio:**

The audio consists of the phrase "Roses are red, violets are blue." repeated multiple times.

**2. Summary of the audio:**

The audio simply repeats the well-known rhyming couplet "Roses are red, violets are blue." over and over again. There is no variation in tone or pacing.

**3. Did you notice anything unusual about the audio?**

Yes, the most unusual thing about the audio is the **extreme repetition**. The phrase is played repeatedly, filling the entire duration of the audio. This is not a typical way to hear a common phrase, making it stand out.<end_of_turn>

This behavior isn't a hallucination that only manifests when the model is asked about finding issues with the audio. It also occurs when trying to "voice-chat" with the model (sending user messages as audio), where the model has issues understanding the input and points out repeated phrases or letters (although it also seems to often be able to correctly respond to such voice-chat messages while pointing out the weirdness). Due to these behaviors, I'm suspecting that the model is trained to and capable of responding to such queries, but there are issues with the Transformers implementation (masking?).

Sign up or log in to comment