Prompt Optimization Benchmark

Tested Model:

Due to the computational resources constraint, we support Qwen3-14B first, and we are looking forward to add closed-source models like o3 , gemini-2.5 etc.

All the models are deployed by vllm, different frameworks may have different results.

Baselines:

Prompt Engineering:

Direct, The model directly infers the result on the dataset.
ZeroCoT, Adding the Let's think step by step in the prompt.
StepBack, Adding the Please first think about the principles involved in solving this task which could be helpful. And Then provide a solution step by step for this question. in the prompt.
Rephrase, Adding the Rephrase and expand the question, and respond. in the prompt.

Prompt Optimization:

ProTeGi, https://arxiv.org/pdf/2305.03495
TextGrad, https://arxiv.org/pdf/2406.07496
PromptBreeder, https://arxiv.org/pdf/2309.16797

Evaluation Metric:

Liar, Judge whether the output(Yes, No) matches the answer.
GSM8K, Judge whether the last digit of the model's output matches the answer.
BBH-Object Counting, Judge whether the last digit of the model's output matches the answer.