Language Models

The Role of Human-in-the-Loop Preferences in Reward Function Learning for Humanoid Tasks | HackerNoon

Table of Links Abstract and Introduction Related Work Problem Definition Method Experiments Conclusion and References A. Appendix A.1. Full Prompts and A.2 ICPL Details A. 3 Baseline Details A.4 Environment Details A.5 Proxy Human Preference A.6 Human-in-the-Loop Preference A.6 HUMAN-IN-THE-LOOP PREFERENCE A.6.1 ISAACGYM TASKS We evaluate human-in-the-loop preference experiments on tasks in IsaacGym, including Quadcopter, Humanoid, Ant, ShadowHand, and AllegroHand. In these experiments, volunteers only provided feedback by comparing videos showcasing the final policies derived

Read More »

Tracking Reward Function Improvement with Proxy Human Preferences in ICPL | HackerNoon

Table of Links Abstract and Introduction Related Work Problem Definition Method Experiments Conclusion and References A. Appendix A.1. Full Prompts and A.2 ICPL Details A. 3 Baseline Details A.4 Environment Details A.5 Proxy Human Preference A.6 Human-in-the-Loop Preference A.5 PROXY HUMAN PREFERENCE A.5.1 ADDITIONAL RESULTS Due to the high variance in LLMs performance, we report the standard deviation across 5 experiments as a supplement, which is presented in Table 5 and Table 6. We also

Read More »
Software

Few-shot In-Context Preference Learning Using Large Language Models: Environment Details | HackerNoon

A. Appendix In Table 4, we present the observation and action dimensions, along with the task description and task metrics for 9 tasks in IsaacGym. Authors: (1) Chao Yu, Tsinghua University; (2) Hong Lu, Tsinghua University; (3) Jiaxuan Gao, Tsinghua University; (4) Qixin Tan, Tsinghua University; (5) Xinting Yang, Tsinghua University; (6) Yu Wang, with equal advising from Tsinghua University; (7) Yi Wu, with equal advising from Tsinghua University and the Shanghai Qi Zhi Institute;

Read More »

ICPL Baseline Methods: Disagreement Sampling and PrefPPO for Reward Learning | HackerNoon

Table of Links Abstract and Introduction Related Work Problem Definition Method Experiments Conclusion and References A. Appendix A.1. Full Prompts and A.2 ICPL Details A. 3 Baseline Details A.4 Environment Details A.5 Proxy Human Preference A.6 Human-in-the-Loop Preference A.3 BASELINE DETAILS To sample trajectories for reward learning, we employ the disagreement sampling scheme from (Lee et al., 2021b) to enhance the training process. This scheme first generates a larger batch of trajectory pairs uniformly at

Read More »