--- license: apache-2.0 datasets: - lighthouse-emnlp2024/Clotho-Moment language: - en --- # Audio Moment-DETR This is a Audio Moment DETR (AM-DETR) proposed in [Language-based Audio Moment Retrieval](https://arxiv.org/abs/2409.15672). Given the text query, AM-DETR searches for specific audio segments relevant to the query from the long audio recording. ## Install Installing [Lighthouse](https://github.com/line/lighthouse) is required. Check the dependencies and your envirionment. ``` apt install ffmpeg ``` ```bash pip install 'git+https://github.com/line/lighthouse.git' ``` ```bash pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 torchtext==0.16.0 transformers==4.51.3 --index-url https://download.pytorch.org/whl/cu118 ``` ## Sample script ```python import io import requests import torch from transformers import AutoModel, AutoConfig repo_id = "lighthouse-emnlp2024/AM-DETR" config = AutoConfig.from_pretrained(repo_id, trust_remote_code=True) config.device="cpu" model = AutoModel.from_pretrained(repo_id, config=config, trust_remote_code=True) audio_bytes = io.BytesIO(requests.get('https://github.com/line/lighthouse/raw/refs/heads/main/api_example/1a-ODBWMUAE.wav').content) query = "Heavy rain falls" feats = model.encode_audio(audio_path=audio_bytes) prediction = model.predict(query, feats) for start, end, score in prediction["pred_relevant_windows"]: print(f"Moment, Score: {start:05.2f} - {end:05.2f}, {score:.2f}") ``` ## Citation ```bibtex @inproceedings{munakata2025language, title={Language-based Audio Moment Retrieval}, author={Munakata, Hokuto and Nishimura, Taichi and Nakada, Shota and Komatsu, Tatsuya}, booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, pages={1--5}, year={2025}, organization={IEEE} } ```