-
Notifications
You must be signed in to change notification settings - Fork 61
Open
Description
Describe the bug
The current GraphGen engine suffers from redundant computation when processing pipelines with multiple outputs. Due to Ray Data's default lazy execution mechanism, intermediate nodes are fully recomputed on every downstream usage instead of being cached and reused. In typical configurations (e.g., generate → evaluate), this results in at least 2x redundant computation, with the amplification effect propagating across the entire upstream chain (chunk → build_kg → partition).
To Reproduce
Steps to reproduce the behavior:
Use this minimal configuration to reproduce:
nodes:
- id: step1
op_name: some_expensive_op
type: map_batch
save_output: true
- id: step2
op_name: downstream_op
dependencies: [step1]
type: map_batch
- id: step3
op_name: another_downstream
dependencies: [step1] # Shares step1's output
type: map_batch
save_output: true
Observe logs to see step1 being executed twice instead of once.
Expected behavior
Choice 1:
- The engine should automatically identify branch nodes and intelligently materialize them to prevent redundant computation, without requiring user configuration changes.
Choice 2:
- The engine should store the result of each task to disk and skip computation if handled.
Metadata
Metadata
Assignees
Labels
No labels