Update README.md

5810a3d verified about 1 year ago

5.28 kB

	---
	language:
	- en
	library_name: transformers
	license: apache-2.0
	metrics:
	- accuracy
	tags:
	- multimodal
	pipeline_tag: video-text-to-text
	model-index:
	- name: VideoChat-Flash-Qwen2-7B_res448
	results:
	- task:
	type: multimodal
	dataset:
	name: MLVU
	type: mlvu
	metrics:
	- type: accuracy
	value: 74.7
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: MVBench
	type: mvbench
	metrics:
	- type: accuracy
	value: 74.0
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: Perception Test
	type: percepTest
	metrics:
	- type: accuracy
	value: 76.2
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: LongVideoBench
	type: longvideobench
	metrics:
	- type: accuracy
	value: 64.7
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: VideoMME (wo sub)
	type: videomme
	metrics:
	- type: accuracy
	value: 65.3
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: LVBench
	type: lvbench
	metrics:
	- type: accuracy
	value: 48.2
	name: accuracy
	verified: true


	---

	# 🦜VideoChat-Flash-Qwen2-7B_res448⚡
	[\[📰 Blog\]](https://internvideo.github.io/blog/2024-12-31-VideoChat-Flash) [\[📂 GitHub\]](https://github.com/OpenGVLab/VideoChat-Flash) [\[📜 Tech Report\]](https://www.arxiv.org/abs/2501.00574) [\[🗨️ Chat Demo\]](https://huggingface.co/spaces/OpenGVLab/VideoChat-Flash)

	VideoChat-Flash-7B is constructed upon UMT-L (300M) and Qwen2-7B, employing only 16 tokens per frame. By leveraging Yarn to extend the context window to 128k (Qwen2's native context window is 32k), our model supports input sequences of up to approximately 10,000 frames.

	> Note: Due to a predominantly English training corpus, the model only exhibits basic Chinese comprehension, to ensure optimal performance, using English for interaction is recommended.



	## 📈 Performance
	\| Model \| MVBench \| LongVideoBench \| VideoMME(w/o sub)\| Max Input Frames\|
	\| --- \| --- \| --- \| --- \| --- \|
	\|[VideoChat-Flash-Qwen2_5-2B@448](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2_5-2B_res448)\| 70.0 \| 58.3 \| 57.0\| 10000 \|
	\|[VideoChat-Flash-Qwen2-7B@224](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2-7B_res224) \| 73.2 \| 64.2 \| 64.0 \| 10000 \|
	\|[VideoChat-Flash-Qwen2_5-7B-1M@224](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2_5-7B-1M_res224) \| 73.4 \| 66.5 \| 63.5 \| 50000 \|
	\|[VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B@224](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B) \| 74.3 \| 64.5 \| 65.1 \| 10000 \|
	\|[VideoChat-Flash-Qwen2-7B@448](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2-7B_res448)\| 74.0\| 64.7 \| 65.3\| 10000 \|



	## 🚀 How to use the model

	First, you need to install [flash attention2](https://github.com/Dao-AILab/flash-attention) and some other modules. We provide a simple installation example below:
	```
	pip install transformers==4.40.1
	pip install av
	pip install imageio
	pip install decord
	pip install opencv-python
	# optional
	pip install flash-attn --no-build-isolation
	```
	Then you could use our model:
	```python
	from transformers import AutoModel, AutoTokenizer
	import torch

	# model setting
	model_path = 'OpenGVLab/VideoChat-Flash-Qwen2-7B_res448'

	tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
	model = AutoModel.from_pretrained(model_path, trust_remote_code=True).to(torch.bfloat16).cuda()
	image_processor = model.get_vision_tower().image_processor

	mm_llm_compress = False # use the global compress or not
	if mm_llm_compress:
	model.config.mm_llm_compress = True
	model.config.llm_compress_type = "uniform0_attention"
	model.config.llm_compress_layer_list = [4, 18]
	model.config.llm_image_token_ratio_list = [1, 0.75, 0.25]
	else:
	model.config.mm_llm_compress = False

	# evaluation setting
	max_num_frames = 512
	generation_config = dict(
	do_sample=False,
	temperature=0.0,
	max_new_tokens=1024,
	top_p=0.1,
	num_beams=1
	)

	video_path = "your_video.mp4"

	# single-turn conversation
	question1 = "Describe this video in detail."
	output1, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question1, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)

	print(output1)

	# multi-turn conversation
	question2 = "How many people appear in the video?"
	output2, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question2, chat_history=chat_history, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)

	print(output2)
	```

	## ✏️ Citation

	```bibtex

	@article{li2024videochatflash,
	title={VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling},
	author={Li, Xinhao and Wang, Yi and Yu, Jiashuo and Zeng, Xiangyu and Zhu, Yuhan and Huang, Haian and Gao, Jianfei and Li, Kunchang and He, Yinan and Wang, Chenting and others},
	journal={arXiv preprint arXiv:2501.00574},
	year={2024}
	}

	```