feat: hot-reload mechanism for geaflow-infer module. by aotenjou · Pull Request #786 · apache/geaflow

aotenjou · 2026-04-17T02:55:40Z

What changes were proposed in this pull request?

How was this PR tested?

This pull request introduces a comprehensive hot-reload mechanism for model inference in the Geaflow system. The changes add new configuration keys, extend the Java inference context and data exchange logic, and implement a robust hot-reload workflow in the Python inference session. These improvements allow for dynamic model updates with configurable polling intervals, backoff strategies, and optional warmup, increasing the flexibility and reliability of model deployment.

Hot-reload configuration and integration:

Added new configuration keys in FrameworkConfigKeys to control model hot-reload behavior, including model path, version file, polling interval, backoff, warmup, and enable flags.
Updated InferContext to read and pass these hot-reload settings as parameters to the inference process, integrating them into the command invocation.
Extended InferEnvironmentContext to generate the appropriate command-line arguments for the new hot-reload parameters.

Resource management improvements:

Refactored DataExchangeContext and DataExchangeQueue to use instance-level AtomicBoolean flags for safe resource cleanup, and improved shutdown hook handling to ensure native memory is released exactly once.

Python inference session hot-reload logic:

Implemented a new hot-reload workflow in inferSession.py using a background thread to monitor the model version manifest, reload models as needed, handle failures with backoff, and optionally perform warmup. This design supports atomic model swaps and robust error handling.

These changes collectively enable dynamic, reliable, and configurable hot-reloading of inference models in Geaflow.

Tests have Added for the changes
Production environment verified

…time Implement blue-green hot swap in TorchInferSession with throttled version polling, async single-flight loading, warmup-before-switch, and rollback with backoff so workers can safely adopt newly published models without request interruption.

aotenjou added 5 commits April 15, 2026 12:06

chore: modified param and test.

3f452d3

fix(infer): make data exchange shutdown idempotent

3bb8e18

fix(infer): emit strict python boolean args

9b59ca0

feat(infer): honor configured hot reload timing

f6daca3

aotenjou marked this pull request as ready for review April 17, 2026 09:36

feat(infer): double-buffer hot reload

b928adf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: hot-reload mechanism for geaflow-infer module.#786

feat: hot-reload mechanism for geaflow-infer module.#786
aotenjou wants to merge 6 commits intoapache:masterfrom
aotenjou:master

aotenjou commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aotenjou commented Apr 17, 2026

What changes were proposed in this pull request?

How was this PR tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant