Tested Model:

Due to the computational resources constraint, we support Qwen3-14B first, and we are looking forward to add closed-source models like o3 , gemini-2.5 etc.

All the models are deployed by vllm, different frameworks may have different results.

Baselines:

Prompt Engineering:

Prompt Optimization:

Evaluation Metric: