Skip to content

[BUG] Ray Data Redundant Execution #141

@ChenZiHong-Gavin

Description

@ChenZiHong-Gavin

Describe the bug
The current GraphGen engine suffers from redundant computation when processing pipelines with multiple outputs. Due to Ray Data's default lazy execution mechanism, intermediate nodes are fully recomputed on every downstream usage instead of being cached and reused. In typical configurations (e.g., generate → evaluate), this results in at least 2x redundant computation, with the amplification effect propagating across the entire upstream chain (chunk → build_kg → partition).

To Reproduce
Steps to reproduce the behavior:
Use this minimal configuration to reproduce:

nodes:
  - id: step1
    op_name: some_expensive_op
    type: map_batch
    save_output: true
    
  - id: step2
    op_name: downstream_op
    dependencies: [step1]
    type: map_batch
    
  - id: step3
    op_name: another_downstream
    dependencies: [step1]  # Shares step1's output
    type: map_batch
    save_output: true

Observe logs to see step1 being executed twice instead of once.

Expected behavior
Choice 1:

  • The engine should automatically identify branch nodes and intelligently materialize them to prevent redundant computation, without requiring user configuration changes.

Choice 2:

  • The engine should store the result of each task to disk and skip computation if handled.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions