OpenDILab Awesome Paper Collection: RL with Human Feedback （4） | by OpenDILab

Here we’re gonna introduce a new repository open-sourced by OpenDILab.

Recently, OpenDILab made a paper collection about Reinforcement Learning with Human Feedback (RLHF) and it has been open-sourced on GitHub. This repository is dedicated to helping researchers to collect the latest papers on RLHF, so that they can get to know this area better and more easily.

About RLHF

Reinforcement Learning with Human Feedback (RLHF) is an extended branch of Reinforcement Learning (RL) that allows the RLHF family of methods to incorporate human feedback into the training process by using this feedback to construct By using this feedback to build a reward model neural network that provides reward signals to help RL intelligences learn, human needs, preferences, and perceptions can be more naturally communicated to the intelligence in an interactive learning manner, aligning the optimization goals between humans and artificial intelligence to produce systems that behave in a manner consistent with human values.

Reinforcement Learning with Human Feedback (RLHF) is an extended branch of Reinforcement Learning. When the optimization goal is abstract and it’s very difficult to define the specific reward function, RLHF can help to put human feedback into the training process. This feedback can be constructed into a reward neural network model so that RL agents can learn from the given reward signal and naturally convey human needs, preference and attitude to agents through interactive learning.

Initially, in the 2017 research work “Deep reinforcement learning from human preferences” [1] some researchers tried to introduce human feedback into classical academic settings of decision making like Atari [2], MuJoCo [3], which led to some interesting findings. Later, the related content has further spawned research subdirections such as preference-based RL/Inverse RL [4].

Since 2020, researchers have further discovered that for Large Language Models (LLMs), the RLHF approach can effectively improve the authenticity and message integrity of LLM generation quality, bridging the gap between the output of LLMs and the conversational information needed by humans [5–6]. In late 2022, ChatGPT [7] was launched and more than 100 million users have tried and experienced the versatility and convenience of this powerful conversational system in just a few months, RLHF successfully brings out the knowledge embedded in LLM and efficiently facilitates the synchronization and coordination between AI and human preferences.

Three possible advantages of RLHF are as follows:

Establishing an optimization paradigm: Establishing a new optimization paradigm for decision making tasks where the reward function cannot be explicitly defined. For machine learning tasks that require human preference guidance, a feasible and more efficient interactive training learning solution is explored.

Data-Efficient: Compared to other training methods, such as supervised learning, Top-K sampling, etc., RLHF is able to achieve similar training results with less human feedback data.

Parameter-Efficient: Compared with other training methods, such as supervised learning, Top-K sampling, etc., RLHF allows neural networks with small number of parameters to perform powerfully.

Selected Papers

Title: WebGPT: Browser-assisted question-answering with human feedback

Authors: Reiichiro Nakano, Jacob Hilton, Suchir Balaji, et al.

Keywords: Model search the web and provide reference， Imitation learning， BC, long form question

Title: Recursively Summarizing Books with Human Feedback

Authors: Jeff Wu, Long Ouyang, Daniel M. Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, Paul Christiano

Keywords: Model trained on small tasks to assist human evaluate broader tasks, BC

Title: Revisiting the Weaknesses of Reinforcement Learning for Neural Machine Translation

Authors: Samuel Kiegeland, Julia Kreutzer

Keywords: The success of policy gradient is because of reward rather than the shape of output distribution, Machine Translation, NMT, DOmain Adaption

Title: Learning to summarize from human feedback

Authors: Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano

Keywords: Care about summary quality, Training loss affect the model behavior, Reward model generalizes to new datasets

Title: Fine-Tuning Language Models from Human Preferences

Authors: Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, Geoffrey Irving

Keywords: Reward learning for language, Continuing text with positive sentiment, Summary task, Physical descriptive

Title: Scalable agent alignment via reward modeling: a research direction

Authors: Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, Shane Legg

Keywords: Agent alignment problem, Learn reward from interaction, Optimize reward with RL, Recursive reward modeling

Title: Reward learning from human preferences and demonstrations in Atari

Authors: Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg, Dario Amodei

Keywords: Expert demonstration trajectory preferences reward hacking problem, Noise in human label

Title: Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces

Authors: Garrett Warnell, Nicholas Waytowich, Vernon Lawhern, Peter Stone

Keywords: High dimension state, Leverage the input of Human trainer

Title: Deep reinforcement learning from human preferences

Authors: Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei

Keywords: Explore goal defined in human preferences between pairs of trajectories segmentation, Learn more complex thing than human feedback

Title: Interactive Learning from Policy-Dependent Human Feedback

Authors: James MacGlashan, Mark K Ho, Robert Loftin, Bei Peng, Guan Wang, David Roberts, Matthew E. Taylor, Michael L. Littman

Keywords: Decision is influenced by current policy rather than human feedback, Learn from policy dependent feedback that converges to a local optimal

References

[1] Christiano P F, Leike J, Brown T, et al. Deep reinforcement learning from human preferences[J]. Advances in neural information processing systems, 2017, 30.

[2] https://gymnasium.farama.org/environments/atari/

[3] https://gymnasium.farama.org/environments/mujoco/

[4] Brown D, Goo W, Nagarajan P, et al. Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations[C]//International conference on machine learning. PMLR, 2019: 783–792.

[5] Ouyang L, Wu J, Jiang X, et al. Training language models to follow instructions with human feedback[J]. arXiv preprint arXiv:2203.02155, 2022.

[6] Ramamurthy R, Ammanabrolu P, Brantley K, et al. Is Reinforcement Learning (Not) for Natural Language Processing?: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization[J]. arXiv preprint arXiv:2210.01241, 2022.

[7] https://openai.com/blog/chatgpt