論文《Interpretable Preferences via Multi‑Objective Reward Modeling and Mixture‑of‑Experts》中所提出的 ArmoRM‑Llama3‑8B‑v0.1 作為評分模型,能從多個可解釋維度(如誠實性、詳盡度、安全性等)對 LLM 生成的合成資料結果進行評估,並運用 Mixture‑of‑Experts 架構提升評分邏輯透明度與準確性。該模型在 RewardBench 評分基準上展現優異表現,已接近 GPT‑4 judge 水準,同時可避免 reward hacking 等隱藏風險。
從這裡可以取得 RLHF-Reward-Modeling 例程式與 ArmoRM-Llama3-8B-v0.1 模型下載,以下是重點整理與實際使用說明
架構說明

使用 ArmoRM-Llama3-8B-v0.1 做為評估模型
評分方式
輸入
Prompt:輸入的問題
'What are some synonyms for the word "beautiful"?'
Response:生成的答案
"Nicely, Beautifully, Handsome, Stunning, Wonderful, Gorgeous, Pretty, Stunning, Elegant"
messages:依據格式準備
messages = [{"role": "user", "content": prompt},
{"role": "assistant", "content": response}]
輸出
支援 19 個項目評估
'helpsteer-helpfulness','helpsteer-correctness','helpsteer-coherence',
'helpsteer-complexity','helpsteer-verbosity','ultrafeedback-overall_score',
'ultrafeedback-instruction_following', 'ultrafeedback-truthfulness',
'ultrafeedback-honesty','ultrafeedback-helpfulness','beavertails-is_safe',
'prometheus-score','argilla-overall_quality','argilla-judge_lm','code-complexity',
'code-style','code-explanation','code-instruction-following','code-readability'
其中支援 HelpSteer dataset
的五種評估
helpfulness, correctness, coherence, complexity, verbosity
範例
隨機抽取HelpSteer dataset
做為範例
範例輸入
prompt = 'What are some synonyms for the word "beautiful"?'
response = "Nicely, Beautifully, Handsome, Stunning, Wonderful, Gorgeous, Pretty, Stunning, Elegant"
翻例輸出
# [helpfulness, correctness, coherence, complexity, verbosity]
標籤 : [3,3,4,2,2]
模型 : [2.7812, 2.8398, 3.4844, 1.3945, 1.3262]
範例列表
