docs(reward): add reward customization guide with critic-like and generative examples by Zijun9 · Pull Request #1007 · inclusionAI/AReaL

Zijun9 · 2026-03-07T23:25:44Z

Description

Add bilingual (EN/ZH) reward customization guide covering three reward paradigms,
all referencing existing code in the repository:

Rule-based: exact match, math verification (gsm8k.py, geometry3k.py)
Model-based (critic-like): using pretrained AutoModelForSequenceClassification
models (trained via examples/alignment/)
Generative (LLM-as-judge): using a separate inference engine as judge
(referencing examples/search_agent/tongyi_deepresearch/)

Also covers AsyncRewardWrapper usage and training integration patterns.

Related Issue

Relates to 2026 Q1 Roadmap — Example of using a generative or critic-like reward model

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not
work as expected)
Documentation update
Code refactoring (no functional changes)
Performance improvement
Test coverage improvement

Checklist

I have read the Contributing Guide
I have run formatting tools (pre-commit or manual)
I have run relevant unit tests and they pass
I have added tests for new functionality
I have updated documentation if needed
My branch is up to date with main
This PR introduces breaking changes (if yes, fill out details below)
If this PR changes documentation, I have built and previewed it locally with
jb build docs
No critical issues raised by AI reviewers (/gemini review)

Breaking Change Details (if applicable):

N/A

Additional Context

Files changed:

docs/en/customization/reward.md — New English guide
docs/zh/customization/reward.md — Chinese translation
docs/en/_toc.yml / docs/zh/_toc.yml — Register new page

All examples reference existing code (gsm8k.py, alignment/, tongyi_deepresearch/).
Docs built successfully with ./docs/build_all.sh.

Need help? Check the Contributing Guide or ask in
GitHub Discussions!

gemini-code-assist · 2026-03-07T23:25:58Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the flexibility of reward function definition within AReaL by introducing a new module for model-based reward functions and providing a clear example of its application. It also delivers comprehensive, bilingual documentation that guides users through various reward paradigms, from simple rule-based methods to advanced generative models, thereby empowering users to tailor reward signals more effectively for diverse reinforcement learning tasks.

Highlights

New Reward Model Functionality: Introduced areal/reward/reward_model.py, providing a critic-like reward function that leverages pretrained AutoModelForSequenceClassification models for scoring LLM completions. This fills a gap by allowing the use of learned reward signals in RL training.
Example Training Script: Added examples/math/gsm8k_reward_model.py, an example script demonstrating how to use the new reward model function for RL training on the GSM8K dataset, replacing traditional rule-based rewards.
Comprehensive Reward Customization Guide: Created bilingual documentation (docs/{en,zh}/customization/reward.md) that thoroughly explains how to implement custom reward functions. The guide covers three paradigms: rule-based, model-based (critic-like), and generative (LLM-as-judge), with references to existing code examples.
Documentation Integration: Updated the English and Chinese tables of contents (docs/{en,zh}/_toc.yml) to include the new reward customization guide, making it easily discoverable.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

areal/reward/init.py
- Registered reward_model_reward_fn for lazy import.
areal/reward/reward_model.py
- Added a new module to implement a pretrained sequence classification model as a reward function.
docs/en/_toc.yml
- Updated the English table of contents to include the new reward customization guide.
docs/en/customization/reward.md
- Added comprehensive English documentation on custom reward functions, covering rule-based, model-based, and generative approaches.
docs/zh/_toc.yml
- Updated the Chinese table of contents to include the new reward customization guide.
docs/zh/customization/reward.md
- Added comprehensive Chinese documentation on custom reward functions, covering rule-based, model-based, and generative approaches.
examples/math/gsm8k_reward_model.py
- Added an example script demonstrating RL training with a pretrained reward model for GSM8K.

Activity

No specific activity (comments, reviews, etc.) was provided in the context for this pull request.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a valuable new feature: a critic-like reward model function, complete with a training example and comprehensive bilingual documentation. A critical security vulnerability was identified in the reward_model_reward_fn where an untrusted model path from the dataset is used to load a model, potentially leading to Remote Code Execution (RCE). It is recommended to prioritize environment variables or configuration for the model path and avoid loading from arbitrary paths in the dataset. Additionally, feedback focuses on enhancing the robustness of the new reward function and ensuring documentation examples align with best practices, including making model loading thread-safe, allowing for a configurable tokenizer max_length, and improving documentation's code examples for better error handling and device management.

gemini-code-assist · 2026-03-07T23:27:39Z

areal/reward/reward_model.py

+    model_path = kwargs.get(
+        "reward_model_path", os.environ.get("REWARD_MODEL_PATH", "")
+    )


The reward_model_reward_fn function retrieves the reward_model_path from the kwargs dictionary, which contains all columns from the dataset. This path is then passed to AutoModelForSequenceClassification.from_pretrained, which can load and execute arbitrary code if the model is in a malicious format (e.g., a PyTorch pickle file). Since the dataset is often sourced from external, untrusted locations, an attacker could include a malicious reward_model_path in the dataset to achieve Remote Code Execution (RCE) on the machine running the training. At a minimum, the environment variable REWARD_MODEL_PATH should take precedence over the dataset field to prevent accidental or malicious overrides. Ideally, the model path should only be configurable via environment variables or explicit configuration, and not from the dataset itself.

model_path = os.environ.get("REWARD_MODEL_PATH", "")

gemini-code-assist · 2026-03-07T23:27:39Z

areal/reward/reward_model.py

+        or a HuggingFace Hub model identifier.
+    """
+    global _model, _tokenizer, _device
+    if _model is not None:


This lazy initialization of the reward model is not thread-safe. In a multi-threaded environment, multiple threads could pass the _model is not None check simultaneously, leading to a race condition where the model is loaded multiple times. To make this more robust, consider using a threading.Lock with a double-checked locking pattern to ensure the check-and-load operation is atomic.

gemini-code-assist · 2026-03-07T23:27:39Z

areal/reward/reward_model.py

+            text,
+            return_tensors="pt",
+            truncation=True,
+            max_length=2048,


The max_length is hardcoded to 2048. This might not be optimal for all reward models, as some may support longer or require shorter contexts. It would be more flexible to make this value configurable, for instance by allowing it to be passed via **kwargs.

max_length=kwargs.get("reward_model_max_length", 2048),

gemini-code-assist · 2026-03-07T23:27:39Z

docs/en/customization/reward.md

+def my_reward_model_fn(prompt, completions, prompt_ids, completion_ids, **kwargs):
+    global _model, _tokenizer
+    if _model is None:
+        _tokenizer = AutoTokenizer.from_pretrained("my-org/my-reward-model")
+        _model = AutoModelForSequenceClassification.from_pretrained(
+            "my-org/my-reward-model", torch_dtype=torch.bfloat16
+        ).cuda().eval()
+
+    inputs = _tokenizer(
+        prompt + completions,
+        return_tensors="pt",
+        truncation=True,
+        max_length=2048,
+    ).to("cuda")
+
+    with torch.no_grad():
+        score = _model(**inputs).logits.squeeze().float().item()
+    return score


The code example for a custom model-based reward could be improved to better reflect the best practices mentioned in the 'Key points' section and the implementation in areal/reward/reward_model.py.

The current example lacks:

try...except for error handling.

Dynamic device placement (cuda or cpu) instead of hardcoding .cuda().

A mention of thread-safe lazy loading.

Updating the example to include these would make it a more robust template for users.

def my_reward_model_fn(prompt, completions, prompt_ids, completion_ids, **kwargs): global _model, _tokenizer, _device # Lazy load and cache the model and tokenizer if _model is None: # For production use, consider adding a lock for thread-safe initialization model_path = "my-org/my-reward-model" _tokenizer = AutoTokenizer.from_pretrained(model_path) _model = AutoModelForSequenceClassification.from_pretrained( model_path, torch_dtype=torch.bfloat16 ) _device = "cuda" if torch.cuda.is_available() else "cpu" _model.to(_device).eval() try: inputs = _tokenizer( prompt + completions, return_tensors="pt", truncation=True, max_length=2048, ).to(_device) with torch.no_grad(): score = _model(**inputs).logits.squeeze().float().item() return score except Exception as e: # It's good practice to log the error and return a default reward # to prevent the training loop from crashing. print(f"Error computing reward: {e}") return 0.0

gemini-code-assist · 2026-03-07T23:27:39Z

docs/zh/customization/reward.md

+def my_reward_model_fn(prompt, completions, prompt_ids, completion_ids, **kwargs):
+    global _model, _tokenizer
+    if _model is None:
+        _tokenizer = AutoTokenizer.from_pretrained("my-org/my-reward-model")
+        _model = AutoModelForSequenceClassification.from_pretrained(
+            "my-org/my-reward-model", torch_dtype=torch.bfloat16
+        ).cuda().eval()
+
+    inputs = _tokenizer(
+        prompt + completions,
+        return_tensors="pt",
+        truncation=True,
+        max_length=2048,
+    ).to("cuda")
+
+    with torch.no_grad():
+        score = _model(**inputs).logits.squeeze().float().item()
+    return score


The code example for a custom model-based reward could be improved to better reflect the best practices mentioned in the 'Key points' section and the implementation in areal/reward/reward_model.py.

The current example lacks:

try...except for error handling.

Dynamic device placement (cuda or cpu) instead of hardcoding .cuda().

A mention of thread-safe lazy loading.

Updating the example to include these would make it a more robust template for users.

def my_reward_model_fn(prompt, completions, prompt_ids, completion_ids, **kwargs): global _model, _tokenizer, _device # Lazy load and cache the model and tokenizer if _model is None: # For production use, consider adding a lock for thread-safe initialization model_path = "my-org/my-reward-model" _tokenizer = AutoTokenizer.from_pretrained(model_path) _model = AutoModelForSequenceClassification.from_pretrained( model_path, torch_dtype=torch.bfloat16 ) _device = "cuda" if torch.cuda.is_available() else "cpu" _model.to(_device).eval() try: inputs = _tokenizer( prompt + completions, return_tensors="pt", truncation=True, max_length=2048, ).to(_device) with torch.no_grad(): score = _model(**inputs).logits.squeeze().float().item() return score except Exception as e: # It's good practice to log the error and return a default reward # to prevent the training loop from crashing. print(f"Error computing reward: {e}") return 0.0

…erative examples Add bilingual (EN/ZH) reward customization guide covering: - Rule-based rewards (exact match, math verification) - Model-based rewards (critic-like with AutoModelForSequenceClassification) - Generative rewards (LLM-as-judge, referencing tongyi_deepresearch) - AsyncRewardWrapper usage - Training integration patterns All examples reference existing code in the repository.

gemini-code-assist bot reviewed Mar 7, 2026

View reviewed changes

Zijun9 force-pushed the feat/reward-model-example branch from db30879 to b987ed1 Compare March 7, 2026 23:35

Zijun9 changed the title ~~feat(reward): add reward model example and customization guide~~ docs(reward): add reward customization guide with critic-like and generative examples Mar 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(reward): add reward customization guide with critic-like and generative examples#1007

docs(reward): add reward customization guide with critic-like and generative examples#1007
Zijun9 wants to merge 1 commit intoinclusionAI:mainfrom
Zijun9:feat/reward-model-example

Zijun9 commented Mar 7, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Mar 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 7, 2026

Uh oh!

gemini-code-assist bot Mar 7, 2026

Uh oh!

gemini-code-assist bot Mar 7, 2026

Uh oh!

gemini-code-assist bot Mar 7, 2026

Uh oh!

gemini-code-assist bot Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Zijun9 commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Type of Change

Checklist

Additional Context

Uh oh!

gemini-code-assist bot commented Mar 7, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Zijun9 commented Mar 7, 2026 •

edited

Loading