why I can't reproduce the results of your paper
I encountered an issue while trying to reproduce the results of your paper. When I tested SWE-bench Lite using OpenHands and Qwen3-Coder-30B, the accuracy was 27.7%, which matches the results reported in your paper. However, when I switched to the EKTO model, the accuracy only increased to 31%, which is still significantly lower than your reported 44.7%.
To further investigate, I ran tests using R2E. In R2E, by default, Qwen does not use function calls. In my tests, without enabling function calls, the accuracy was 32.7%, and with function calls enabled, it increased to 37.3%, which is still considerably lower than the reported result.
Below are my vLLM launch commands and R2E parameter configurations. Could the discrepancy in accuracy be due to differences in parameter settings?
vllm:
CUDA_VISIBLE_DEVICES=1,2
vllm serve Qwen3-Coder-30B-A3B-Instruct
--port 4444
--trust-remote-code
--gpu-memory-utilization 0.95
--max-model-len 150000
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--tensor-parallel-size 2
collectXX.py
config = {
'traj_dir': args.traj_dir,
'exp_name': args.exp_name,
'max_workers': 5,
'max_steps': 200,
'max_steps_absolute': 200,
# 'llm_name': 'openai/Qwen/Qwen3-Coder-480B-A35B-Instruct',
# 'llm_name': 'openai/Qwen/Qwen3-Coder-30B-A3B-Instruct',
'llm_name': 'openai/qwen3-coder-30b-ekto-merged',
# 'llm_name': 'openai/Qwen3-Coder-30B-A3B-Instruct',
# 'llm_name': 'openai/glm-4.5-air',
# 'llm_name': 'openai/glm-4.5',
# 'llm_name': 'anthropic/claude-4-sonnet-20250514',
'use_fn_calling': args.fn_calling,
'temperature': 0.7,
'max_tokens': 65536, # From DeepSWE reproduction guide
'backend': "docker",
'max_reward_calc_time': 1200, # From DeepSWE reproduction guide
'max_iterations': 1,
'scaffold': "r2egym",
'num_restarts': 1,
yaml:
command_files:
- "./src/r2egym/agenthub/tools/r2egym/file_editor.py"
- "./src/r2egym/agenthub/tools/search.py"
- "./src/r2egym/agenthub/tools/r2egym/execute_bash.py"
- "./src/r2egym/agenthub/tools/finish.py"
llm_name: "qwen3-coder-plus-2025-07-22"
demo_file: "./r2egym/agenthub/config/localizer-demo"
llm_base_url: "http://90.254.86.39:3333/v1"
other_args:
max_retries: 3
timeout: 3600
top_p: 0.8
top_k: 20
repetition_penalty: 1.05
Hi, thanks for your interest in our work. We used the max_tokens=131k in our experiments, and this could be a major factor accounting for the performance difference. As to scaffold, the EKTO model is trained with R2E(no functions call to make the dataset consistent across different models), so you should also expect using R2E scaffold to have better performance. It is a bit wired that your achieve better performance 37.3% with function calling, as the SFT and KTO dataset only contain trajs without function calling, could you double check the setting, especially the loaded config? I have looked the other params in your setting, which should be fine except the max_tokens.
Thank you for your reply. I had set the max model length, mistakenly thinking it was the same as max tokens. I’ve now corrected the setting and will try again.
Thank you very much for your response.