Tested Model:
Due to the computational resources constraint, we support Qwen3-14B
first, and we are looking forward to add closed-source models like o3
, gemini-2.5
etc.
All the models are deployed by vllm
, different frameworks may have different results.
Baselines:
Prompt Engineering:
Let's think step by step
in the prompt.Please first think about the principles involved in solving this task which could be helpful. And Then provide a solution step by step for this question.
in the prompt.Rephrase and expand the question, and respond.
in the prompt.Prompt Optimization:
Evaluation Metric: