Reward Models

Created: by Pradeep GowdaUpdated:Aug 01, 2024Tagged: llm .

Lambert, Nathan, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, et al. RewardBench: Evaluating reward models for language modeling,” 2024. AbstractReward models (RMs) are at the crux of successful RLHF to align pretrained models to human preferences, yet there has been relatively little study that focuses on evaluation of those reward models. Evaluating reward models presents an opportunity to understand the opaque technologies used for alignment of language models and which values are embedded in them. To date, very few descriptors of capabilities, training methods, or open-source reward models exist. In this paper, we present RewardBench, a benchmark dataset and code-base for evaluation, to enhance scientific understanding of reward models. The RewardBench dataset is a collection of prompt-win-lose trios spanning chat, reasoning, and safety, to benchmark how reward models perform on challenging, structured and out-of-distribution queries. We created specific comparison datasets for RMs that have subtle, but verifiable reasons (e.g. bugs, incorrect facts) why one answer should be preferred to another. On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods, such as the direct MLE training of classifiers and the implicit reward modeling of Direct Preference Optimization (DPO), and on a spectrum of datasets. We present many findings on propensity for refusals, reasoning limitations, and instruction following shortcomings of various reward models towards a better understanding of the RLHF process. https://arxiv.org/abs/2403.13787.

Reward Model
RewardBench

With RewardBench, we:

  1. Release a common framework for evaluating the many different architectures of reward models, along with tools for visualization, training, and other analysis. We also release all data used in the evaluation, composed of text-score pairs for all inputs, to enable further data analysis on the properties of reward models.
  2. Illustrate the differences between DPO and classifier-based reward models across a variety of datasets. DPO models, while more plentiful due to the method’s simplicity, fail to generalize to popular preference data test sets and present a higher variance in performance.
  3. Chart the landscape of current state-of-the-art reward models. We showcase the scaling laws, the propensity to refuse (or not), the reasoning capabilities, and more for popular RMs.
  4. Show the limitations of existing preference data test sets for evaluating these models, showcasing common pitfalls of RMs on subtle, but challenging instruction pairs (e.g. intentionally modified rejected responses, which superficially look high quality but answer the wrong prompt).

Yuan, Weizhe, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. “Self-rewarding language models,” 2024. AbstractWe posit that to achieve superhuman agents, future models require super- human feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding Language ModelsSelf-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613. While there is much left still to explore, this work opens the door to the possibility of models that can continually improve in both axes. https://arxiv.org/abs/2401.10020.

 Self Rewarding Language Models  
Self Rewarding Language Models

Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself.

Self Play Fine Tuning (SPIN)

According to Das, Sayantan. “Model Alignment Process,” 2024. https://blog.premai.io/model-alignment-process/., (creator of premai, which is also a good overview of RHLF (Reinforcement Learning from Human Feedback), DPO (Direct Preference Optimization), IPO (Identify Preference Optimization), the propose Self-Play Fine-Tuning (SPIN) is a method designed to enhance the capabilities of Large Language Models (LLMs) through a unique self-improvement process that doesn’t rely on additional human-annotated data – Chen, Zixiang, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. “Self-play fine-tuning converts weak language models to strong language models,” 2024. https://arxiv.org/abs/2401.01335.