Skip to content

docs(reward): add reward customization guide with critic-like and generative examples#1007

Open
Zijun9 wants to merge 1 commit intoinclusionAI:mainfrom
Zijun9:feat/reward-model-example
Open

docs(reward): add reward customization guide with critic-like and generative examples#1007
Zijun9 wants to merge 1 commit intoinclusionAI:mainfrom
Zijun9:feat/reward-model-example

Conversation

@Zijun9
Copy link
Contributor

@Zijun9 Zijun9 commented Mar 7, 2026

Description

Add bilingual (EN/ZH) reward customization guide covering three reward paradigms,
all referencing existing code in the repository:

  • Rule-based: exact match, math verification (gsm8k.py, geometry3k.py)
  • Model-based (critic-like): using pretrained AutoModelForSequenceClassification
    models (trained via examples/alignment/)
  • Generative (LLM-as-judge): using a separate inference engine as judge
    (referencing examples/search_agent/tongyi_deepresearch/)

Also covers AsyncRewardWrapper usage and training integration patterns.

Related Issue

Relates to 2026 Q1 Roadmap — Example of using a generative or critic-like reward model

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not
    work as expected)
  • Documentation update
  • Code refactoring (no functional changes)
  • Performance improvement
  • Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • I have run formatting tools (pre-commit or manual)
  • I have run relevant unit tests and they pass
  • I have added tests for new functionality
  • I have updated documentation if needed
  • My branch is up to date with main
  • This PR introduces breaking changes (if yes, fill out details below)
  • If this PR changes documentation, I have built and previewed it locally with
    jb build docs
  • No critical issues raised by AI reviewers (/gemini review)

Breaking Change Details (if applicable):

N/A

Additional Context

Files changed:

  • docs/en/customization/reward.md — New English guide
  • docs/zh/customization/reward.md — Chinese translation
  • docs/en/_toc.yml / docs/zh/_toc.yml — Register new page

All examples reference existing code (gsm8k.py, alignment/, tongyi_deepresearch/).
Docs built successfully with ./docs/build_all.sh.


Need help? Check the Contributing Guide or ask in
GitHub Discussions!

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the flexibility of reward function definition within AReaL by introducing a new module for model-based reward functions and providing a clear example of its application. It also delivers comprehensive, bilingual documentation that guides users through various reward paradigms, from simple rule-based methods to advanced generative models, thereby empowering users to tailor reward signals more effectively for diverse reinforcement learning tasks.

Highlights

  • New Reward Model Functionality: Introduced areal/reward/reward_model.py, providing a critic-like reward function that leverages pretrained AutoModelForSequenceClassification models for scoring LLM completions. This fills a gap by allowing the use of learned reward signals in RL training.
  • Example Training Script: Added examples/math/gsm8k_reward_model.py, an example script demonstrating how to use the new reward model function for RL training on the GSM8K dataset, replacing traditional rule-based rewards.
  • Comprehensive Reward Customization Guide: Created bilingual documentation (docs/{en,zh}/customization/reward.md) that thoroughly explains how to implement custom reward functions. The guide covers three paradigms: rule-based, model-based (critic-like), and generative (LLM-as-judge), with references to existing code examples.
  • Documentation Integration: Updated the English and Chinese tables of contents (docs/{en,zh}/_toc.yml) to include the new reward customization guide, making it easily discoverable.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • areal/reward/init.py
    • Registered reward_model_reward_fn for lazy import.
  • areal/reward/reward_model.py
    • Added a new module to implement a pretrained sequence classification model as a reward function.
  • docs/en/_toc.yml
    • Updated the English table of contents to include the new reward customization guide.
  • docs/en/customization/reward.md
    • Added comprehensive English documentation on custom reward functions, covering rule-based, model-based, and generative approaches.
  • docs/zh/_toc.yml
    • Updated the Chinese table of contents to include the new reward customization guide.
  • docs/zh/customization/reward.md
    • Added comprehensive Chinese documentation on custom reward functions, covering rule-based, model-based, and generative approaches.
  • examples/math/gsm8k_reward_model.py
    • Added an example script demonstrating RL training with a pretrained reward model for GSM8K.
Activity
  • No specific activity (comments, reviews, etc.) was provided in the context for this pull request.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable new feature: a critic-like reward model function, complete with a training example and comprehensive bilingual documentation. A critical security vulnerability was identified in the reward_model_reward_fn where an untrusted model path from the dataset is used to load a model, potentially leading to Remote Code Execution (RCE). It is recommended to prioritize environment variables or configuration for the model path and avoid loading from arbitrary paths in the dataset. Additionally, feedback focuses on enhancing the robustness of the new reward function and ensuring documentation examples align with best practices, including making model loading thread-safe, allowing for a configurable tokenizer max_length, and improving documentation's code examples for better error handling and device management.

Comment on lines +89 to +91
model_path = kwargs.get(
"reward_model_path", os.environ.get("REWARD_MODEL_PATH", "")
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The reward_model_reward_fn function retrieves the reward_model_path from the kwargs dictionary, which contains all columns from the dataset. This path is then passed to AutoModelForSequenceClassification.from_pretrained, which can load and execute arbitrary code if the model is in a malicious format (e.g., a PyTorch pickle file). Since the dataset is often sourced from external, untrusted locations, an attacker could include a malicious reward_model_path in the dataset to achieve Remote Code Execution (RCE) on the machine running the training. At a minimum, the environment variable REWARD_MODEL_PATH should take precedence over the dataset field to prevent accidental or malicious overrides. Ideally, the model path should only be configurable via environment variables or explicit configuration, and not from the dataset itself.

    model_path = os.environ.get("REWARD_MODEL_PATH", "")

or a HuggingFace Hub model identifier.
"""
global _model, _tokenizer, _device
if _model is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This lazy initialization of the reward model is not thread-safe. In a multi-threaded environment, multiple threads could pass the _model is not None check simultaneously, leading to a race condition where the model is loaded multiple times. To make this more robust, consider using a threading.Lock with a double-checked locking pattern to ensure the check-and-load operation is atomic.

text,
return_tensors="pt",
truncation=True,
max_length=2048,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The max_length is hardcoded to 2048. This might not be optimal for all reward models, as some may support longer or require shorter contexts. It would be more flexible to make this value configurable, for instance by allowing it to be passed via **kwargs.

            max_length=kwargs.get("reward_model_max_length", 2048),

Comment on lines +108 to +125
def my_reward_model_fn(prompt, completions, prompt_ids, completion_ids, **kwargs):
global _model, _tokenizer
if _model is None:
_tokenizer = AutoTokenizer.from_pretrained("my-org/my-reward-model")
_model = AutoModelForSequenceClassification.from_pretrained(
"my-org/my-reward-model", torch_dtype=torch.bfloat16
).cuda().eval()

inputs = _tokenizer(
prompt + completions,
return_tensors="pt",
truncation=True,
max_length=2048,
).to("cuda")

with torch.no_grad():
score = _model(**inputs).logits.squeeze().float().item()
return score
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The code example for a custom model-based reward could be improved to better reflect the best practices mentioned in the 'Key points' section and the implementation in areal/reward/reward_model.py.

The current example lacks:

  1. try...except for error handling.
  2. Dynamic device placement (cuda or cpu) instead of hardcoding .cuda().
  3. A mention of thread-safe lazy loading.

Updating the example to include these would make it a more robust template for users.

def my_reward_model_fn(prompt, completions, prompt_ids, completion_ids, **kwargs):
    global _model, _tokenizer, _device

    # Lazy load and cache the model and tokenizer
    if _model is None:
        # For production use, consider adding a lock for thread-safe initialization
        model_path = "my-org/my-reward-model"
        _tokenizer = AutoTokenizer.from_pretrained(model_path)
        _model = AutoModelForSequenceClassification.from_pretrained(
            model_path, torch_dtype=torch.bfloat16
        )
        _device = "cuda" if torch.cuda.is_available() else "cpu"
        _model.to(_device).eval()

    try:
        inputs = _tokenizer(
            prompt + completions,
            return_tensors="pt",
            truncation=True,
            max_length=2048,
        ).to(_device)

        with torch.no_grad():
            score = _model(**inputs).logits.squeeze().float().item()
        return score
    except Exception as e:
        # It's good practice to log the error and return a default reward
        # to prevent the training loop from crashing.
        print(f"Error computing reward: {e}")
        return 0.0

Comment on lines +104 to +121
def my_reward_model_fn(prompt, completions, prompt_ids, completion_ids, **kwargs):
global _model, _tokenizer
if _model is None:
_tokenizer = AutoTokenizer.from_pretrained("my-org/my-reward-model")
_model = AutoModelForSequenceClassification.from_pretrained(
"my-org/my-reward-model", torch_dtype=torch.bfloat16
).cuda().eval()

inputs = _tokenizer(
prompt + completions,
return_tensors="pt",
truncation=True,
max_length=2048,
).to("cuda")

with torch.no_grad():
score = _model(**inputs).logits.squeeze().float().item()
return score
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The code example for a custom model-based reward could be improved to better reflect the best practices mentioned in the 'Key points' section and the implementation in areal/reward/reward_model.py.

The current example lacks:

  1. try...except for error handling.
  2. Dynamic device placement (cuda or cpu) instead of hardcoding .cuda().
  3. A mention of thread-safe lazy loading.

Updating the example to include these would make it a more robust template for users.

def my_reward_model_fn(prompt, completions, prompt_ids, completion_ids, **kwargs):
    global _model, _tokenizer, _device

    # Lazy load and cache the model and tokenizer
    if _model is None:
        # For production use, consider adding a lock for thread-safe initialization
        model_path = "my-org/my-reward-model"
        _tokenizer = AutoTokenizer.from_pretrained(model_path)
        _model = AutoModelForSequenceClassification.from_pretrained(
            model_path, torch_dtype=torch.bfloat16
        )
        _device = "cuda" if torch.cuda.is_available() else "cpu"
        _model.to(_device).eval()

    try:
        inputs = _tokenizer(
            prompt + completions,
            return_tensors="pt",
            truncation=True,
            max_length=2048,
        ).to(_device)

        with torch.no_grad():
            score = _model(**inputs).logits.squeeze().float().item()
        return score
    except Exception as e:
        # It's good practice to log the error and return a default reward
        # to prevent the training loop from crashing.
        print(f"Error computing reward: {e}")
        return 0.0

…erative examples

Add bilingual (EN/ZH) reward customization guide covering:
- Rule-based rewards (exact match, math verification)
- Model-based rewards (critic-like with AutoModelForSequenceClassification)
- Generative rewards (LLM-as-judge, referencing tongyi_deepresearch)
- AsyncRewardWrapper usage
- Training integration patterns

All examples reference existing code in the repository.
@Zijun9 Zijun9 force-pushed the feat/reward-model-example branch from db30879 to b987ed1 Compare March 7, 2026 23:35
@Zijun9 Zijun9 changed the title feat(reward): add reward model example and customization guide docs(reward): add reward customization guide with critic-like and generative examples Mar 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant