[BUG] Ray Data Redundant Execution

**Describe the bug**
The current GraphGen engine suffers from redundant computation when processing pipelines with multiple outputs. Due to Ray Data's default lazy execution mechanism, intermediate nodes are fully recomputed on every downstream usage instead of being cached and reused. In typical configurations (e.g., generate → evaluate), this results in at least 2x redundant computation, with the amplification effect propagating across the entire upstream chain (chunk → build_kg → partition).

**To Reproduce**
Steps to reproduce the behavior:
Use this minimal configuration to reproduce:
```
nodes:
  - id: step1
    op_name: some_expensive_op
    type: map_batch
    save_output: true
    
  - id: step2
    op_name: downstream_op
    dependencies: [step1]
    type: map_batch
    
  - id: step3
    op_name: another_downstream
    dependencies: [step1]  # Shares step1's output
    type: map_batch
    save_output: true
```
Observe logs to see step1 being executed twice instead of once.

**Expected behavior**
Choice 1:
- The engine should automatically identify branch nodes and intelligently materialize them to prevent redundant computation, without requiring user configuration changes.

Choice 2:
- The engine should store the result of each task to disk and skip computation if handled.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Ray Data Redundant Execution #141

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Ray Data Redundant Execution #141

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions