Instructions to use hongyuw/bitvla-bitsiglipL-224px-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use hongyuw/bitvla-bitsiglipL-224px-bf16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="hongyuw/bitvla-bitsiglipL-224px-bf16")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("hongyuw/bitvla-bitsiglipL-224px-bf16")
model = AutoModelForImageTextToText.from_pretrained("hongyuw/bitvla-bitsiglipL-224px-bf16")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use hongyuw/bitvla-bitsiglipL-224px-bf16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "hongyuw/bitvla-bitsiglipL-224px-bf16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "hongyuw/bitvla-bitsiglipL-224px-bf16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/hongyuw/bitvla-bitsiglipL-224px-bf16

SGLang

How to use hongyuw/bitvla-bitsiglipL-224px-bf16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "hongyuw/bitvla-bitsiglipL-224px-bf16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "hongyuw/bitvla-bitsiglipL-224px-bf16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "hongyuw/bitvla-bitsiglipL-224px-bf16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "hongyuw/bitvla-bitsiglipL-224px-bf16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use hongyuw/bitvla-bitsiglipL-224px-bf16 with Docker Model Runner:
```
docker model run hf.co/hongyuw/bitvla-bitsiglipL-224px-bf16
```

BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation

[paper] [model] [code]

June 2025: BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation

Open Source Plan

✅ Paper, Pre-trained VLM and evaluation code.
✅ Fine-tuned VLA code and models
🧭 Pre-training code and VLA.

BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation

Checkpoints

Model	Path
BitVLA	hongyuw/bitvla-bitsiglipL-224px-bf16
BitVLA finetuned on LIBERO-Spatial	hongyuw/ft-bitvla-bitsiglipL-224px-libero_spatial-bf16
BitVLA finetuned on LIBERO-Object	hongyuw/ft-bitvla-bitsiglipL-224px-libero_object-bf16
BitVLA finetuned on LIBERO-Goal	hongyuw/ft-bitvla-bitsiglipL-224px-libero_long-bf16
BitVLA finetuned on LIBERO-Long	hongyuw/ft-bitvla-bitsiglipL-224px-libero_long-bf16
BitVLA w/ BF16 SigLIP	hongyuw/bitvla-siglipL-224px-bf16

Note that we provide the master weights of BitVLA and perform online quantization. For actual memory savings, you may quantize the weights offline to 1.58-bit precision. We recommend using the bitnet.cpp inference framework to accurately measure the reduction in inference cost.

Due to limited resources, we have not yet pre-trained BitVLA on a large-scale robotics dataset. We are actively working to secure additional compute resources to conduct this pre-training.

Vision-Language

Evaluation on VQA

We use the LMM-Eval toolkit to conduct evaluations on VQA tasks. We provide the transformers repo in which we modify the modeling_llava.py and modeling_siglip.py to support the W1.58-A8 quantization.

The evaluation should use nvidia_24_07 docker. Install the packages:

docker run --name nvidia_24_07  --privileged --net=host --ipc=host --gpus=all -v /mnt:/mnt -v /tmp:/tmp -d nvcr.io/nvidia/pytorch:24.07-py3 sleep infinity # only use for multimodal evaluation
docker exec -it nvidia_24_07 bash
git clone https://github.com/ustcwhy/BitVLA.git
cd BitVLA/
bash vl_eval_setup.sh # only use for multimodal evaluation

First, download the BitVLA model from HuggingFace:

git clone https://huggingface.co/hongyuw/bitvla-bitsiglipL-224px-bf16 # BitVLA w/ W1.58-A8 SigLIP-L
git clone https://huggingface.co/hongyuw/bitvla-siglipL-224px-bf16 # BitVLA w/ BF16 SigLIP-L

Then run the following scripts to conduct evaluations:

cd lmms-eval/
bash eval-dense-hf.sh /YOUR_PATH_TO_EXP/bitvla-bitsiglipL-224px-bf16
bash eval-dense-hf.sh /YOUR_PATH_TO_EXP/bitvla-siglipL-224px-bf16

Vision-Language-Action

OFT Training

1. Preparing OFT

We fine-tune BitVLA using OFT training shown in OpenVLA-OFT. First setup the environment as required by that project. You can refer to SETUP.md and LIBERO.md for detailed instructions.

conda create -n bitvla python=3.10 -y
conda activate bitvla
pip install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 --index-url https://download.pytorch.org/whl/cu124

# or use the provided docker
# docker run --name nvidia_24_07  --privileged --net=host --ipc=host --gpus=all -v /mnt:/mnt -v /tmp:/tmp -d nvcr.io/nvidia/pytorch:24.07-py3 sleep infinity

cd BitVLA
pip install -e openvla-oft/
pip install -e transformers

cd openvla-oft/

# install LIBERO
git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
pip install -e LIBERO/
# in BitVLA
pip install -r experiments/robot/libero/libero_requirements.txt

# install bitvla
pip install -e bitvla/

We adopt the same dataset as OpenVLA-OFT for the fine-tuning on LIBERO. You can download the dataset from HuggingFace.

git clone git@hf.co:datasets/openvla/modified_libero_rlds

2. OFT fine-tuning

First convert the BitVLA to a format compatible with the VLA codebase.

python convert_ckpt.py /path/to/bitvla-bitsiglipL-224px-bf16

After that, you can finetune the BitVLA using the following command. Here we take LIBERO spatial as an example:

torchrun --standalone --nnodes 1 --nproc-per-node 4 vla-scripts/finetune_bitnet.py \
  --vla_path /path/to/bitvla-bitsiglipL-224px-bf16 \
  --data_root_dir /path/to/modified_libero_rlds/ \
  --dataset_name libero_spatial_no_noops \
  --run_root_dir /path/to/save/your/ckpt \
  --use_l1_regression True \
  --warmup_steps 375 \
  --use_lora False \
  --num_images_in_input 2 \
  --use_proprio True \
  --batch_size 2 \
  --grad_accumulation_steps 8 \
  --learning_rate 1e-4 \
  --max_steps 10001 \
  --save_freq 10000 \
  --save_latest_checkpoint_only False \
  --image_aug True \
  --run_id_note your_id

Evaluation on LIBERO

You can download our fine-tuned BitVLA models from HuggingFace. As an example for spatial set in LIBERO, run the following script for evaluation:

python experiments/robot/libero/run_libero_eval_bitnet.py \
    --pretrained_checkpoint  /path/to/ft-bitvla-bitsiglipL-224px-libero_spatial-bf16 \
    --task_suite_name libero_spatial \
    --info_in_path "information you want to show in path" \
    --model_family "bitnet"

Acknowledgement

This repository is built using LMM-Eval, the HuggingFace's transformers and OpenVLA-OFT.

Citation

If you find this repository useful, please consider citing our work:

@article{bitvla,
  title={BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation}, 
  author={Hongyu Wang and Chuyan Xiong and Ruiping Wang and Xilin Chen},
  year={2025},
  eprint={2506.07530},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
}

License

This project is licensed under the MIT License.

Contact Information

For help or issues using models, please submit a GitHub issue.

Downloads last month: 21

Model tree for hongyuw/bitvla-bitsiglipL-224px-bf16

Base model

microsoft/bitnet-b1.58-2B-4T

Finetuned

(17)

this model

Finetunes

4 models

Datasets used to train hongyuw/bitvla-bitsiglipL-224px-bf16

Collection including hongyuw/bitvla-bitsiglipL-224px-bf16

BitVLA

Collection

1-bit Vision-Language-Action Models for Robotics Manipulation • 9 items • Updated Mar 2 • 4

Paper for hongyuw/bitvla-bitsiglipL-224px-bf16

BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation

Paper • 2506.07530 • Published Jun 9, 2025 • 20

hongyuw
/

bitvla-bitsiglipL-224px-bf16

BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation

Open Source Plan

Contents

Checkpoints

Vision-Language

Evaluation on VQA

Vision-Language-Action

OFT Training

1. Preparing OFT

2. OFT fine-tuning

Evaluation on LIBERO

Acknowledgement

Citation

License

Contact Information

Model tree for hongyuw/bitvla-bitsiglipL-224px-bf16

Datasets used to train hongyuw/bitvla-bitsiglipL-224px-bf16

Collection including hongyuw/bitvla-bitsiglipL-224px-bf16

BitVLA

Paper for hongyuw/bitvla-bitsiglipL-224px-bf16

BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation