docs(reward): add reward customization guide with critic-like and generative examples#1007
docs(reward): add reward customization guide with critic-like and generative examples#1007Zijun9 wants to merge 1 commit intoinclusionAI:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the flexibility of reward function definition within AReaL by introducing a new module for model-based reward functions and providing a clear example of its application. It also delivers comprehensive, bilingual documentation that guides users through various reward paradigms, from simple rule-based methods to advanced generative models, thereby empowering users to tailor reward signals more effectively for diverse reinforcement learning tasks. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a valuable new feature: a critic-like reward model function, complete with a training example and comprehensive bilingual documentation. A critical security vulnerability was identified in the reward_model_reward_fn where an untrusted model path from the dataset is used to load a model, potentially leading to Remote Code Execution (RCE). It is recommended to prioritize environment variables or configuration for the model path and avoid loading from arbitrary paths in the dataset. Additionally, feedback focuses on enhancing the robustness of the new reward function and ensuring documentation examples align with best practices, including making model loading thread-safe, allowing for a configurable tokenizer max_length, and improving documentation's code examples for better error handling and device management.
areal/reward/reward_model.py
Outdated
| model_path = kwargs.get( | ||
| "reward_model_path", os.environ.get("REWARD_MODEL_PATH", "") | ||
| ) |
There was a problem hiding this comment.
The reward_model_reward_fn function retrieves the reward_model_path from the kwargs dictionary, which contains all columns from the dataset. This path is then passed to AutoModelForSequenceClassification.from_pretrained, which can load and execute arbitrary code if the model is in a malicious format (e.g., a PyTorch pickle file). Since the dataset is often sourced from external, untrusted locations, an attacker could include a malicious reward_model_path in the dataset to achieve Remote Code Execution (RCE) on the machine running the training. At a minimum, the environment variable REWARD_MODEL_PATH should take precedence over the dataset field to prevent accidental or malicious overrides. Ideally, the model path should only be configurable via environment variables or explicit configuration, and not from the dataset itself.
model_path = os.environ.get("REWARD_MODEL_PATH", "")
areal/reward/reward_model.py
Outdated
| or a HuggingFace Hub model identifier. | ||
| """ | ||
| global _model, _tokenizer, _device | ||
| if _model is not None: |
There was a problem hiding this comment.
This lazy initialization of the reward model is not thread-safe. In a multi-threaded environment, multiple threads could pass the _model is not None check simultaneously, leading to a race condition where the model is loaded multiple times. To make this more robust, consider using a threading.Lock with a double-checked locking pattern to ensure the check-and-load operation is atomic.
areal/reward/reward_model.py
Outdated
| text, | ||
| return_tensors="pt", | ||
| truncation=True, | ||
| max_length=2048, |
There was a problem hiding this comment.
The max_length is hardcoded to 2048. This might not be optimal for all reward models, as some may support longer or require shorter contexts. It would be more flexible to make this value configurable, for instance by allowing it to be passed via **kwargs.
max_length=kwargs.get("reward_model_max_length", 2048),| def my_reward_model_fn(prompt, completions, prompt_ids, completion_ids, **kwargs): | ||
| global _model, _tokenizer | ||
| if _model is None: | ||
| _tokenizer = AutoTokenizer.from_pretrained("my-org/my-reward-model") | ||
| _model = AutoModelForSequenceClassification.from_pretrained( | ||
| "my-org/my-reward-model", torch_dtype=torch.bfloat16 | ||
| ).cuda().eval() | ||
|
|
||
| inputs = _tokenizer( | ||
| prompt + completions, | ||
| return_tensors="pt", | ||
| truncation=True, | ||
| max_length=2048, | ||
| ).to("cuda") | ||
|
|
||
| with torch.no_grad(): | ||
| score = _model(**inputs).logits.squeeze().float().item() | ||
| return score |
There was a problem hiding this comment.
The code example for a custom model-based reward could be improved to better reflect the best practices mentioned in the 'Key points' section and the implementation in areal/reward/reward_model.py.
The current example lacks:
try...exceptfor error handling.- Dynamic device placement (
cudaorcpu) instead of hardcoding.cuda(). - A mention of thread-safe lazy loading.
Updating the example to include these would make it a more robust template for users.
def my_reward_model_fn(prompt, completions, prompt_ids, completion_ids, **kwargs):
global _model, _tokenizer, _device
# Lazy load and cache the model and tokenizer
if _model is None:
# For production use, consider adding a lock for thread-safe initialization
model_path = "my-org/my-reward-model"
_tokenizer = AutoTokenizer.from_pretrained(model_path)
_model = AutoModelForSequenceClassification.from_pretrained(
model_path, torch_dtype=torch.bfloat16
)
_device = "cuda" if torch.cuda.is_available() else "cpu"
_model.to(_device).eval()
try:
inputs = _tokenizer(
prompt + completions,
return_tensors="pt",
truncation=True,
max_length=2048,
).to(_device)
with torch.no_grad():
score = _model(**inputs).logits.squeeze().float().item()
return score
except Exception as e:
# It's good practice to log the error and return a default reward
# to prevent the training loop from crashing.
print(f"Error computing reward: {e}")
return 0.0| def my_reward_model_fn(prompt, completions, prompt_ids, completion_ids, **kwargs): | ||
| global _model, _tokenizer | ||
| if _model is None: | ||
| _tokenizer = AutoTokenizer.from_pretrained("my-org/my-reward-model") | ||
| _model = AutoModelForSequenceClassification.from_pretrained( | ||
| "my-org/my-reward-model", torch_dtype=torch.bfloat16 | ||
| ).cuda().eval() | ||
|
|
||
| inputs = _tokenizer( | ||
| prompt + completions, | ||
| return_tensors="pt", | ||
| truncation=True, | ||
| max_length=2048, | ||
| ).to("cuda") | ||
|
|
||
| with torch.no_grad(): | ||
| score = _model(**inputs).logits.squeeze().float().item() | ||
| return score |
There was a problem hiding this comment.
The code example for a custom model-based reward could be improved to better reflect the best practices mentioned in the 'Key points' section and the implementation in areal/reward/reward_model.py.
The current example lacks:
try...exceptfor error handling.- Dynamic device placement (
cudaorcpu) instead of hardcoding.cuda(). - A mention of thread-safe lazy loading.
Updating the example to include these would make it a more robust template for users.
def my_reward_model_fn(prompt, completions, prompt_ids, completion_ids, **kwargs):
global _model, _tokenizer, _device
# Lazy load and cache the model and tokenizer
if _model is None:
# For production use, consider adding a lock for thread-safe initialization
model_path = "my-org/my-reward-model"
_tokenizer = AutoTokenizer.from_pretrained(model_path)
_model = AutoModelForSequenceClassification.from_pretrained(
model_path, torch_dtype=torch.bfloat16
)
_device = "cuda" if torch.cuda.is_available() else "cpu"
_model.to(_device).eval()
try:
inputs = _tokenizer(
prompt + completions,
return_tensors="pt",
truncation=True,
max_length=2048,
).to(_device)
with torch.no_grad():
score = _model(**inputs).logits.squeeze().float().item()
return score
except Exception as e:
# It's good practice to log the error and return a default reward
# to prevent the training loop from crashing.
print(f"Error computing reward: {e}")
return 0.0…erative examples Add bilingual (EN/ZH) reward customization guide covering: - Rule-based rewards (exact match, math verification) - Model-based rewards (critic-like with AutoModelForSequenceClassification) - Generative rewards (LLM-as-judge, referencing tongyi_deepresearch) - AsyncRewardWrapper usage - Training integration patterns All examples reference existing code in the repository.
db30879 to
b987ed1
Compare
Description
Add bilingual (EN/ZH) reward customization guide covering three reward paradigms,
all referencing existing code in the repository:
gsm8k.py,geometry3k.py)AutoModelForSequenceClassificationmodels (trained via
examples/alignment/)(referencing
examples/search_agent/tongyi_deepresearch/)Also covers
AsyncRewardWrapperusage and training integration patterns.Related Issue
Relates to 2026 Q1 Roadmap — Example of using a generative or critic-like reward model
Type of Change
work as expected)
Checklist
jb build docs/gemini review)Breaking Change Details (if applicable):
N/A
Additional Context
Files changed:
docs/en/customization/reward.md— New English guidedocs/zh/customization/reward.md— Chinese translationdocs/en/_toc.yml/docs/zh/_toc.yml— Register new pageAll examples reference existing code (
gsm8k.py,alignment/,tongyi_deepresearch/).Docs built successfully with
./docs/build_all.sh.Need help? Check the Contributing Guide or ask in
GitHub Discussions!