xywang626 commited on
Commit
7582df2
·
verified ·
1 Parent(s): 3c2aeee

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +136 -154
README.md CHANGED
@@ -29,7 +29,7 @@ library_name: transformers
29
  line-height:1.25;
30
  text-align:center;
31
  margin:0 0 24px;">
32
- OpenCUA: Open Foundations for Computer-Use Agents
33
  </h1>
34
 
35
  <div style="
@@ -38,7 +38,7 @@ library_name: transformers
38
  gap:12px;
39
  flex-wrap:wrap;
40
  margin-bottom:28px;">
41
-
42
  <a href="https://opencua.xlang.ai/" style="
43
  display:inline-block;
44
  padding:8px 24px;
@@ -78,6 +78,24 @@ library_name: transformers
78
 
79
  <div style="max-width:900px;margin:0 auto;">
80
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81
  # Introduction
82
  <div style="
83
  max-width: 880px; /* 可按需调节整体宽度 */
@@ -85,12 +103,15 @@ library_name: transformers
85
  text-align: justify; /* 关键:两端对齐 */
86
  text-justify: inter-word; /* 优化英文对齐效果 */
87
  line-height: 1.6;">
88
-
89
- OpenCUA models (OpenCUA-7B and OpenCUA-32B) are end-to-end computer-use foundation models than can produce executable actions in the computer environments. They are based on the weights of Qwen2.5-VL-7B-Instruction and Qwen2.5-VL-32B-Instruction.
90
- They demonstrate superior performance across CUA benchmarks. In particular, <b>OpenCUA-32B</b> achieves an average success rate of **34.8%** on [OSWorld-Verified](https://os-world.github.io/),
91
- establishing a new state-of-the-art (SOTA) among open-source models and surpassing OpenAI CUA (GPT-4o). Both models also have strong grounding performance, OpenCUA-32B achieves 59.6% on [OSWorld-G](https://osworld-grounding.github.io/) and 55.3% on [Screenspot-Pro](https://arxiv.org/abs/2504.07981).
92
  </div>
93
 
 
 
 
94
  ### Key Features
95
 
96
  - **Superior Computer-Use Capablity**: Able to execute multi-step computer-use actions with effective planning and reasoning
@@ -103,9 +124,8 @@ establishing a new state-of-the-art (SOTA) among open-source models and surpassi
103
  # Performance
104
 
105
  ### Online Agent Evaluation
106
- OpenCUA models achieves strong performance on **[OSWorld-Verified](https://os-world.github.io/)**.
107
- OPENCUA-32B achieves the best performance among all open-source models with an average success rate of 34.8%, outperforming prior baselines by large margins.
108
- It also closes the gap to proprietary Claude models.
109
  <div align="center">
110
 
111
  | **Model** | **15 Steps** | **50 Steps** | **100 Steps** |
@@ -116,13 +136,14 @@ It also closes the gap to proprietary Claude models.
116
  | Claude 3.7 Sonnet | 27.1 | 35.8 | 35.9 |
117
  | Claude 4 Sonnet | 31.2 | 43.9 | 41.5 |
118
  | **Open-Source** | | | |
119
- | Qwen 2.5-VL-32B-Instruct | 3.0 | — | 3.9 |
120
- | Qwen 2.5-VL-72B-Instruct | 4.4 | — | 5.0 |
121
  | Kimi-VL-A3B | 9.7 | — | 10.3 |
122
  | UI-TARS-72B-DPO | 24.0 | 25.8 | 27.1 |
123
  | UI-TARS-1.5-7B | 24.5 | 27.3 | 27.4 |
124
  | OpenCUA-7B *(Ours)* | 24.3 | 27.9 | 26.6 |
125
- | **OpenCUA-32B *(Ours)*** | **29.7** | **34.1** | **34.8** |
 
126
  </div>
127
 
128
  *OpenCUA scores are the mean of 3 independent runs.*
@@ -130,15 +151,14 @@ It also closes the gap to proprietary Claude models.
130
  ### GUI Grounding Performance
131
  <div align="center">
132
 
133
- | **Model** | **OSWorld-G** | **ScreenSpot-V2** | **ScreenSpot-Pro** |
134
- |-------|-----------|---------------|----------------|
135
- | Qwen2.5-VL-7B | 31.4 | 88.8 | 27.6 |
136
- | Qwen2.5-VL-32B | 46.5 | 87.0 | 39.4 |
137
- | UI-TARS-72B | 57.1 | 90.3 | 38.1 |
138
- | **OpenCUA-A3B** | 48.6 | 91.4 | 28.5 |
139
- | **OpenCUA-Qwen2-7B** | 45.7 | 88.5 | 23.7 |
140
- | **OpenCUA-7B** | 55.3 | 92.3 | 50.0 |
141
- | **OpenCUA-32B** | **59.6** | **93.4** | **55.3** |
142
  </div>
143
 
144
 
@@ -157,145 +177,124 @@ It also closes the gap to proprietary Claude models.
157
 
158
  # 🚀 Quick Start
159
  <div style="border-left: 6px solid #f28c28; background: #fff8e6; padding: 12px 16px; margin: 16px 0;">
160
- <strong>⚠️ Important for Qwen-based Models (OpenCUA-7B, OpenCUA-32B):</strong>
161
-
162
  To align with our training infrastructure, we have modified the model in two places:
163
  <ul style="margin-top: 8px;">
164
  <li>1. Multimodal Rotary Position Embedding (M-RoPE) has been replaced with 1D RoPE</strong>.</li>
165
  <li>2. Using the same Tokenizer and ChatTemplate as Kimi-VL.</li>
166
- <li>Do not use the default transformers and vllm classes to load the model. Tokenizer and Chat Template should be aligned if training the models.</li>
167
  </ul>
168
  </div>
169
 
170
 
171
  ## Installation & Download
172
 
173
- First, install the required transformers dependencies:
174
 
175
  ```bash
176
- conda create -n opencua python=3.10
177
  conda activate opencua
178
- pip install -r requirement.txt
179
  ```
180
 
181
- Download the model weight from huggingface:
182
- ```bash
183
  from huggingface_hub import snapshot_download
184
  snapshot_download(
185
  repo_id="xlangai/OpenCUA-32B",
186
- local_dir="OpenCUA-32B",
187
- local_dir_use_symlinks=False
188
  )
189
  ```
190
 
191
- ## 🎯 GUI Grounding
192
 
193
- The following code demonstrates how to use OpenCUA models for GUI grounding tasks:
 
 
 
 
 
 
 
 
 
 
 
194
 
195
  ```python
196
  import base64
197
- import torch
198
- from transformers import AutoTokenizer, AutoModel, AutoImageProcessor
199
- from PIL import Image
200
- import json
 
201
 
202
  def encode_image(image_path: str) -> str:
203
- """Encode image to base64 string for model input."""
204
  with open(image_path, "rb") as f:
205
  return base64.b64encode(f.read()).decode()
206
 
207
- def load_opencua_model(model_path: str):
208
- """Load OpenCUA model, tokenizer, and image processor."""
209
- tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
210
- model = AutoModel.from_pretrained(
211
- model_path,
212
- torch_dtype="auto",
213
- device_map="auto",
214
- trust_remote_code=True
215
- )
216
- image_processor = AutoImageProcessor.from_pretrained(model_path, trust_remote_code=True)
217
-
218
- return model, tokenizer, image_processor
219
 
220
- def create_grounding_messages(image_path: str, instruction: str):
221
- """Create chat messages for GUI grounding task."""
222
  system_prompt = (
223
  "You are a GUI agent. You are given a task and a screenshot of the screen. "
224
  "You need to perform a series of pyautogui actions to complete the task."
225
  )
226
-
227
  messages = [
228
  {"role": "system", "content": system_prompt},
229
  {
230
  "role": "user",
231
  "content": [
232
- {"type": "image", "image": f"data:image/png;base64,{encode_image(image_path)}"},
 
 
 
233
  {"type": "text", "text": instruction},
234
  ],
235
  },
236
  ]
237
- return messages
238
 
239
- def run_inference(model, tokenizer, image_processor, messages, image_path):
240
- """Run inference on the model."""
241
- # Prepare text input
242
- input_ids = tokenizer.apply_chat_template(
243
- messages, tokenize=True, add_generation_prompt=True
244
- )
245
- input_ids = torch.tensor([input_ids]).to(model.device)
246
-
247
- # Prepare image input
248
- image = Image.open(image_path).convert('RGB')
249
- image_info = image_processor.preprocess(images=[image])
250
- pixel_values = torch.tensor(image_info['pixel_values']).to(
251
- dtype=torch.bfloat16, device=model.device
252
  )
253
- grid_thws = torch.tensor(image_info['image_grid_thw'])
254
-
255
- # Generate response
256
- with torch.no_grad():
257
- generated_ids = model.generate(
258
- input_ids,
259
- pixel_values=pixel_values,
260
- grid_thws=grid_thws,
261
- max_new_tokens=512,
262
- temperature=0
263
- )
264
-
265
- # Decode output
266
- prompt_len = input_ids.shape[1]
267
- generated_ids = generated_ids[:, prompt_len:]
268
- output_text = tokenizer.batch_decode(
269
- generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
270
- )[0]
271
-
272
- return output_text
273
 
274
  # Example usage
275
- model_path = "xlangai/OpenCUA-32B" # or other model variants
276
  image_path = "screenshot.png"
277
  instruction = "Click on the submit button"
278
 
279
- # Load model
280
- model, tokenizer, image_processor = load_opencua_model(model_path)
281
-
282
- # Create messages and run inference
283
- messages = create_grounding_messages(image_path, instruction)
284
- result = run_inference(model, tokenizer, image_processor, messages, image_path)
285
-
286
  print("Model output:", result)
287
  ```
288
 
289
  <div style="border-left: 6px solid #9ca3af; background: #f5f5f5; padding: 12px 16px; margin: 16px 0;">
290
- <em>Expected result: ```python
291
- pyautogui.click(x=1432, y=344)
292
- ```</em>
293
  </div>
294
 
 
 
 
 
 
 
 
 
 
 
 
295
  ## 🖥️ Computer Use Agent
296
  **[OpenCUAAgent](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/opencua_agent.py)** is developed in the [OSWorld](https://github.com/xlang-ai/OSWorld) environment based on OpenCUA models. It iteratively perceives the environment via screenshots, produces reflective long CoT as inner monologue, and predicts the next action to be executed. OpenCUAAgent uses 3 images in total and L2 CoT format in default.
297
 
298
- Command for running OpenCUA-7B and OpenCUA-32B in OSWorld:
299
  ```
300
  python run_multienv_opencua.py \
301
  --headless \
@@ -306,74 +305,57 @@ Command for running OpenCUA-7B and OpenCUA-32B in OSWorld:
306
  --num_envs 30 \
307
  --coordinate_type qwen25
308
  ```
309
- <div style="border-left: 6px solid #9ca3af; background: #f5f5f5; padding: 12px 16px; margin: 16px 0;">
310
- <em>Currently we only supports huggingface inference. We are implementing the vLLM supports of OpenCUA models. Please stay tuned.</em>
311
- </div>
312
 
313
  ## Important Notes on Coordinate Systems
314
  <div style="border-left: 6px solid #9ca3af; background: #f5f5f5; padding: 12px 16px; margin: 16px 0;">
315
  <ul style="margin: 0;">
316
- <li><strong><code>xlangai/OpenCUA-A3B</code></strong> – Relative coordinates <em>(not supported in this code)</em></li>
317
- <li><strong><code>xlangai/OpenCUA-Qwen2-7B</code></strong> – Relative coordinates</li>
318
- <li><strong><code>xlangai/OpenCUA-7B</code></strong> – Absolute coordinates</li>
319
- <li><strong><code>xlangai/OpenCUA-32B</code></strong> – Absolute coordinates</li>
320
  </ul>
321
  </div>
322
 
323
- **OpenCUA models use different coordinate systems depending on the base model:**
324
-
325
- - **OpenCUA-Qwen2-7B**: Outputs **relative coordinates** (0.0 to 1.0 range)
326
- ```python
327
- # Example output: pyautogui.click(x=0.5, y=0.3)
328
- # x=0.5 means 50% from left edge, y=0.3 means 30% from top edge
329
-
330
- # Convert to absolute coordinates:
331
- def qwen2_relative_to_absolute(rel_x, rel_y, original_width, original_height):
332
- abs_x = int(rel_x * original_width)
333
- abs_y = int(rel_y * original_height)
334
- return abs_x, abs_y
335
- ```
336
-
337
- - **OpenCUA-7B and OpenCUA-32B** (Qwen2.5-based): Output **absolute coordinates** after smart resize
338
- ```python
339
- # Example output: pyautogui.click(x=960, y=324)
340
- # These are coordinates on the smart-resized image, not the original image
341
-
342
- # Convert to original image coordinates:
343
- # Please refer to the smart_resize function in: https://github.com/huggingface/transformers/blob/67ddc82fbc7e52c6f42a395b4a6d278c55b77a39/src/transformers/models/qwen2_vl/image_processing_qwen2_vl.py#L55
344
- def qwen25_smart_resize_to_absolute(model_x, model_y, original_width, original_height):
345
- # First, calculate the smart-resized dimensions
346
- resized_height, resized_width = smart_resize(original_height, original_width, factor = 28, min_pixels = 3136, max_pixels = 12845056)
347
-
348
- # Convert model output to relative coordinates on original image
349
- rel_x = model_x / resized_width
350
- rel_y = model_y / resized_height
351
-
352
- # Then convert to absolute coordinates on original image
353
- abs_x = int(rel_x * original_width)
354
- abs_y = int(rel_y * original_height)
355
- return abs_x, abs_y
356
- ```
357
 
358
  <div style="border-left: 6px solid #9ca3af; background: #f5f5f5; padding: 12px 16px; margin: 16px 0;">
359
  <strong>Understanding Smart Resize for Qwen2.5-based Models:</strong>
360
  <p style="margin: 8px 0 0;">
361
- The Qwen2.5-VL models use a smart resize preprocessing that maintains aspect ratio while fitting within pixel constraints.
362
  For coordinate conversion, you need the smart resize function from the
363
  <a href="https://github.com/QwenLM/Qwen2.5-VL/blob/d2240f11656bfe404b9ba56db4e51cd09f522ff1/qwen-vl-utils/src/qwen_vl_utils/vision_process.py#L60">
364
  official Qwen2.5-VL implementation</a>.
365
  </p>
366
  </div>
367
 
368
-
369
- # TODO
370
- ## vLLM Support
371
- We are actively working with the vLLM team to add support for OpenCUA models.
372
-
373
- **Workaround:** For now, please use the standard transformers library as shown in the examples above. We will update this section once vLLM support becomes available.
374
-
375
- ## Training Code
376
- OpenCUA models are developed based on the training infrastructure of Kimi Team. We are developting the training pipeline based on the open-source infrastructure as well.
377
 
378
  <div style="text-align:center;">
379
 
@@ -404,14 +386,14 @@ If you use OpenCUA models in your research, please cite our work:
404
 
405
  ```bibtex
406
  @misc{wang2025opencuaopenfoundationscomputeruse,
407
- title={OpenCUA: Open Foundations for Computer-Use Agents},
408
  author={Xinyuan Wang and Bowen Wang and Dunjie Lu and Junlin Yang and Tianbao Xie and Junli Wang and Jiaqi Deng and Xiaole Guo and Yiheng Xu and Chen Henry Wu and Zhennan Shen and Zhuokai Li and Ryan Li and Xiaochuan Li and Junda Chen and Boyuan Zheng and Peihang Li and Fangyu Lei and Ruisheng Cao and Yeqiao Fu and Dongchan Shin and Martin Shin and Jiarui Hu and Yuyan Wang and Jixuan Chen and Yuxiao Ye and Danyang Zhang and Dikang Du and Hao Hu and Huarong Chen and Zaida Zhou and Haotian Yao and Ziwei Chen and Qizheng Gu and Yipu Wang and Heng Wang and Diyi Yang and Victor Zhong and Flood Sung and Y. Charles and Zhilin Yang and Tao Yu},
409
  year={2025},
410
  eprint={2508.09123},
411
  archivePrefix={arXiv},
412
  primaryClass={cs.AI},
413
- url={https://arxiv.org/abs/2508.09123},
414
  }
415
  ```
416
 
417
- </div>
 
29
  line-height:1.25;
30
  text-align:center;
31
  margin:0 0 24px;">
32
+ OpenCUA-32B
33
  </h1>
34
 
35
  <div style="
 
38
  gap:12px;
39
  flex-wrap:wrap;
40
  margin-bottom:28px;">
41
+
42
  <a href="https://opencua.xlang.ai/" style="
43
  display:inline-block;
44
  padding:8px 24px;
 
78
 
79
  <div style="max-width:900px;margin:0 auto;">
80
 
81
+ # 🚀 vLLM Serve (Recommended)
82
+
83
+ We recommend using vLLM for production deployment. Requires **vllm>=0.12.0** with `--trust-remote-code`.
84
+
85
+ ```bash
86
+ # OpenCUA-32B (4 GPUs, tensor parallel)
87
+ vllm serve xlangai/OpenCUA-32B \
88
+ --trust-remote-code \
89
+ --tensor-parallel-size 4 \
90
+ --served-model-name opencua-32b \
91
+ --host 0.0.0.0 \
92
+ --port 8000
93
+ ```
94
+
95
+ Adjust `--tensor-parallel-size` and `--gpu-memory-utilization` based on your hardware configuration.
96
+
97
+ ---
98
+
99
  # Introduction
100
  <div style="
101
  max-width: 880px; /* 可按需调节整体宽度 */
 
103
  text-align: justify; /* 关键:两端对齐 */
104
  text-justify: inter-word; /* 优化英文对齐效果 */
105
  line-height: 1.6;">
106
+
107
+ OpenCUA models (OpenCUA-7B, OpenCUA-32B, and OpenCUA-72B) are end-to-end computer-use foundation models that can produce executable actions in the computer environments with great planning and grounding capabilities. They are based on the Qwen2.5-VL model family.
108
+
109
+ With the help of OpenCUA framework, our end-to-end agent models demonstrate strong performance across CUA benchmarks. In particular, <b>OpenCUA-72B</b> achieves an average success rate of **45.0%** on [OSWorld-Verified](https://os-world.github.io/), establishing a new state-of-the-art (SOTA) among open-source models. OpenCUA-72B also has strong grounding ability, achieving 37.3% (SOTA) on [UI-Vision](https://arxiv.org/abs/2504.07981) and 60.8% on [ScreenSpot-Pro](https://arxiv.org/abs/2504.07981).
110
  </div>
111
 
112
+ ## 📢 Updates
113
+ - 2026-01-17: 🎉 **vLLM now fully supports OpenCUA-7B, OpenCUA-32B, and OpenCUA-72B!** Thanks to the [Meituan EvoCUA Team](https://github.com/meituan) for their contributions to vLLM integration.
114
+
115
  ### Key Features
116
 
117
  - **Superior Computer-Use Capablity**: Able to execute multi-step computer-use actions with effective planning and reasoning
 
124
  # Performance
125
 
126
  ### Online Agent Evaluation
127
+ OpenCUA models achieves strong performance on **[OSWorld-Verified](https://os-world.github.io/)**.
128
+ OpenCUA-72B achieves the best performance among all open-source models with an average success rate of 45.0%, establishing a new state-of-the-art (SOTA).
 
129
  <div align="center">
130
 
131
  | **Model** | **15 Steps** | **50 Steps** | **100 Steps** |
 
136
  | Claude 3.7 Sonnet | 27.1 | 35.8 | 35.9 |
137
  | Claude 4 Sonnet | 31.2 | 43.9 | 41.5 |
138
  | **Open-Source** | | | |
139
+ | Qwen 2.5-VL-32B-Instruct | 3.0 | — | 3.9 |
140
+ | Qwen 2.5-VL-72B-Instruct | 4.4 | — | 5.0 |
141
  | Kimi-VL-A3B | 9.7 | — | 10.3 |
142
  | UI-TARS-72B-DPO | 24.0 | 25.8 | 27.1 |
143
  | UI-TARS-1.5-7B | 24.5 | 27.3 | 27.4 |
144
  | OpenCUA-7B *(Ours)* | 24.3 | 27.9 | 26.6 |
145
+ | OpenCUA-32B *(Ours)* | 29.7 | 34.1 | 34.8 |
146
+ | **OpenCUA-72B *(Ours)*** | **39.0** | **44.9** | **45.0** |
147
  </div>
148
 
149
  *OpenCUA scores are the mean of 3 independent runs.*
 
151
  ### GUI Grounding Performance
152
  <div align="center">
153
 
154
+ | **Model** | **OSWorld-G** | **ScreenSpot-V2** | **ScreenSpot-Pro** | **UI-Vision** |
155
+ |-------|-----------|---------------|----------------|----------|
156
+ | Qwen2.5-VL-7B | 31.4 | 88.8 | 27.6 | 0.85 |
157
+ | Qwen2.5-VL-32B | 46.5 | 87.0 | 39.4 | - |
158
+ | UI-TARS-72B | 57.1 | 90.3 | 38.1 | 25.5 |
159
+ | **OpenCUA-7B** | 55.3 | 92.3 | 50.0 | 29.7 |
160
+ | **OpenCUA-32B** | 59.6 | 93.4 | 55.3 | 33.3 |
161
+ | **OpenCUA-72B** | **59.2** | **92.9** | **60.8** | **37.3** |
 
162
  </div>
163
 
164
 
 
177
 
178
  # 🚀 Quick Start
179
  <div style="border-left: 6px solid #f28c28; background: #fff8e6; padding: 12px 16px; margin: 16px 0;">
180
+ <strong>⚠️ Important for Qwen-based Models (OpenCUA-7B, OpenCUA-32B, OpenCUA-72B):</strong>
181
+
182
  To align with our training infrastructure, we have modified the model in two places:
183
  <ul style="margin-top: 8px;">
184
  <li>1. Multimodal Rotary Position Embedding (M-RoPE) has been replaced with 1D RoPE</strong>.</li>
185
  <li>2. Using the same Tokenizer and ChatTemplate as Kimi-VL.</li>
186
+ <li>vLLM supported via <code>--trust-remote-code</code> flag. Tokenizer and Chat Template should be aligned if training the models.</li>
187
  </ul>
188
  </div>
189
 
190
 
191
  ## Installation & Download
192
 
193
+ First, install the required dependencies:
194
 
195
  ```bash
196
+ conda create -n opencua python=3.12
197
  conda activate opencua
198
+ pip install openai>=1.0.0
199
  ```
200
 
201
+ Download the model weight from huggingface (optional, vLLM can download automatically):
202
+ ```python
203
  from huggingface_hub import snapshot_download
204
  snapshot_download(
205
  repo_id="xlangai/OpenCUA-32B",
206
+ local_dir="OpenCUA-32B",
207
+ local_dir_use_symlinks=False
208
  )
209
  ```
210
 
211
+ ## 🎯 GUI Grounding
212
 
213
+ First, start the vLLM server:
214
+
215
+ ```bash
216
+ vllm serve xlangai/OpenCUA-32B \
217
+ --trust-remote-code \
218
+ --tensor-parallel-size 4 \
219
+ --served-model-name opencua-32b \
220
+ --host 0.0.0.0 \
221
+ --port 8000
222
+ ```
223
+
224
+ Then run the following code to test GUI grounding:
225
 
226
  ```python
227
  import base64
228
+ from openai import OpenAI
229
+
230
+ # vLLM server configuration
231
+ VLLM_BASE_URL = "http://localhost:8000/v1"
232
+ MODEL_NAME = "opencua-32b" # Should match --served-model-name in vllm serve
233
 
234
  def encode_image(image_path: str) -> str:
235
+ """Encode image to base64 string."""
236
  with open(image_path, "rb") as f:
237
  return base64.b64encode(f.read()).decode()
238
 
239
+ def run_grounding(image_path: str, instruction: str) -> str:
240
+ """Run GUI grounding inference via vLLM."""
241
+ client = OpenAI(base_url=VLLM_BASE_URL, api_key="EMPTY")
 
 
 
 
 
 
 
 
 
242
 
 
 
243
  system_prompt = (
244
  "You are a GUI agent. You are given a task and a screenshot of the screen. "
245
  "You need to perform a series of pyautogui actions to complete the task."
246
  )
247
+
248
  messages = [
249
  {"role": "system", "content": system_prompt},
250
  {
251
  "role": "user",
252
  "content": [
253
+ {
254
+ "type": "image_url",
255
+ "image_url": {"url": f"data:image/png;base64,{encode_image(image_path)}"}
256
+ },
257
  {"type": "text", "text": instruction},
258
  ],
259
  },
260
  ]
 
261
 
262
+ response = client.chat.completions.create(
263
+ model=MODEL_NAME,
264
+ messages=messages,
265
+ max_tokens=512,
266
+ temperature=0,
 
 
 
 
 
 
 
 
267
  )
268
+
269
+ return response.choices[0].message.content
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
270
 
271
  # Example usage
 
272
  image_path = "screenshot.png"
273
  instruction = "Click on the submit button"
274
 
275
+ result = run_grounding(image_path, instruction)
 
 
 
 
 
 
276
  print("Model output:", result)
277
  ```
278
 
279
  <div style="border-left: 6px solid #9ca3af; background: #f5f5f5; padding: 12px 16px; margin: 16px 0;">
280
+ <em>Expected result:</em> ```python\npyautogui.click(x=1432, y=344)\n```
 
 
281
  </div>
282
 
283
+ You can also run the grounding examples in [OpenCUA/model/inference/](https://github.com/xlang-ai/OpenCUA/blob/main/model/inference/):
284
+ ```bash
285
+ cd ./model/inference/
286
+
287
+ # vLLM (requires running vLLM server first)
288
+ python vllm_inference.py
289
+
290
+ # HuggingFace Transformers
291
+ python huggingface_inference.py
292
+ ```
293
+
294
  ## 🖥️ Computer Use Agent
295
  **[OpenCUAAgent](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/opencua_agent.py)** is developed in the [OSWorld](https://github.com/xlang-ai/OSWorld) environment based on OpenCUA models. It iteratively perceives the environment via screenshots, produces reflective long CoT as inner monologue, and predicts the next action to be executed. OpenCUAAgent uses 3 images in total and L2 CoT format in default.
296
 
297
+ Command for running OpenCUA-32B in OSWorld:
298
  ```
299
  python run_multienv_opencua.py \
300
  --headless \
 
305
  --num_envs 30 \
306
  --coordinate_type qwen25
307
  ```
 
 
 
308
 
309
  ## Important Notes on Coordinate Systems
310
  <div style="border-left: 6px solid #9ca3af; background: #f5f5f5; padding: 12px 16px; margin: 16px 0;">
311
  <ul style="margin: 0;">
312
+ <li><strong><code>OpenCUA/OpenCUA-7B</code></strong> – Absolute coordinates</li>
313
+ <li><strong><code>OpenCUA/OpenCUA-32B</code></strong> – Absolute coordinates</li>
314
+ <li><strong><code>OpenCUA/OpenCUA-72B</code></strong> – Absolute coordinates</li>
 
315
  </ul>
316
  </div>
317
 
318
+ **OpenCUA models output absolute coordinates after smart resize:**
319
+
320
+ ```python
321
+ # Example output: pyautogui.click(x=960, y=324)
322
+ # These are coordinates on the smart-resized image, not the original image
323
+
324
+ # Convert to original image coordinates:
325
+ # Please refer to the smart_resize function in: https://github.com/huggingface/transformers/blob/67ddc82fbc7e52c6f42a395b4a6d278c55b77a39/src/transformers/models/qwen2_vl/image_processing_qwen2_vl.py#L55
326
+ def qwen25_smart_resize_to_absolute(model_x, model_y, original_width, original_height):
327
+ # First, calculate the smart-resized dimensions
328
+ resized_height, resized_width = smart_resize(original_height, original_width, factor = 28, min_pixels = 3136, max_pixels = 12845056)
329
+
330
+ # Convert model output to relative coordinates on original image
331
+ rel_x = model_x / resized_width
332
+ rel_y = model_y / resized_height
333
+
334
+ # Then convert to absolute coordinates on original image
335
+ abs_x = int(rel_x * original_width)
336
+ abs_y = int(rel_y * original_height)
337
+ return abs_x, abs_y
338
+ ```
 
 
 
 
 
 
 
 
 
 
 
 
 
339
 
340
  <div style="border-left: 6px solid #9ca3af; background: #f5f5f5; padding: 12px 16px; margin: 16px 0;">
341
  <strong>Understanding Smart Resize for Qwen2.5-based Models:</strong>
342
  <p style="margin: 8px 0 0;">
343
+ The Qwen2.5-VL models use a "smart resize" preprocessing that maintains aspect ratio while fitting within pixel constraints.
344
  For coordinate conversion, you need the smart resize function from the
345
  <a href="https://github.com/QwenLM/Qwen2.5-VL/blob/d2240f11656bfe404b9ba56db4e51cd09f522ff1/qwen-vl-utils/src/qwen_vl_utils/vision_process.py#L60">
346
  official Qwen2.5-VL implementation</a>.
347
  </p>
348
  </div>
349
 
350
+ # Acknowledge
351
+ <p>
352
+ We thank Yu Su, Caiming Xiong, and the anonymous reviewers for their insightful discussions and valuable feedback.
353
+ We are grateful to Moonshot AI for providing training infrastructure and annotated data.
354
+ We also sincerely appreciate Hao Yang, Zhengtao Wang, and Yanxu Chen from the Kimi Team for their strong infrastructure support and helpful guidance.
355
+ We thank Chong Peng, Taofeng Xue, and Qiumian Huang from the <a href="https://github.com/meituan/EvoCUA" target="_blank">Meituan EvoCUA Team</a> for their contributions to vLLM integration.
356
+ The development of our tool is based on the open-source projects-<a href="https://github.com/TheDuckAI/DuckTrack" target="_blank">DuckTrack</a> and <a href="https://github.com/OpenAdaptAI/OpenAdapt" target="_blank">OpenAdapt</a>.
357
+ We are very grateful to their commitment to the open source community. Finally, we extend our deepest thanks to all annotators for their tremendous effort and contributions to this project.
358
+ </p>
359
 
360
  <div style="text-align:center;">
361
 
 
386
 
387
  ```bibtex
388
  @misc{wang2025opencuaopenfoundationscomputeruse,
389
+ title={OpenCUA: Open Foundations for Computer-Use Agents},
390
  author={Xinyuan Wang and Bowen Wang and Dunjie Lu and Junlin Yang and Tianbao Xie and Junli Wang and Jiaqi Deng and Xiaole Guo and Yiheng Xu and Chen Henry Wu and Zhennan Shen and Zhuokai Li and Ryan Li and Xiaochuan Li and Junda Chen and Boyuan Zheng and Peihang Li and Fangyu Lei and Ruisheng Cao and Yeqiao Fu and Dongchan Shin and Martin Shin and Jiarui Hu and Yuyan Wang and Jixuan Chen and Yuxiao Ye and Danyang Zhang and Dikang Du and Hao Hu and Huarong Chen and Zaida Zhou and Haotian Yao and Ziwei Chen and Qizheng Gu and Yipu Wang and Heng Wang and Diyi Yang and Victor Zhong and Flood Sung and Y. Charles and Zhilin Yang and Tao Yu},
391
  year={2025},
392
  eprint={2508.09123},
393
  archivePrefix={arXiv},
394
  primaryClass={cs.AI},
395
+ url={https://arxiv.org/abs/2508.09123},
396
  }
397
  ```
398
 
399
+ </div>