Skip to content

Job lineage shows only one upstream edge for multi-run job with different IO sets #3093

@ElwesHonorato

Description

@ElwesHonorato

Context

I am testing OpenLineage ingestion with Marquez and observed a mismatch between:

  • run history for a job, and
  • job-level lineage edges shown for that same job.

Environment

  • Marquez container image: marquezproject/marquez:latest (running app jar: marquez-api-0.51.1.jar)
  • Marquez API endpoint: POST /api/v1/lineage
  • Namespace/job under test: governed-rag / worker_embed_chunks

Repro Steps

  1. Start from a clean Marquez DB.
  2. Post 03_04_chunk_text events.
  3. Confirm 03_04_chunk_text outputs exactly 2 datasets:
    • 04_chunks/...part1/...chunk.json
    • 04_chunks/...part2/...chunk.json
  4. Post 04_05_embed_chunks events with 2 independent runs:
    • Run A consumes 04_chunks/...part1/...chunk.json and produces 05_embeddings/...part1/...embedding.json
    • Run B consumes 04_chunks/...part2/...chunk.json and produces 05_embeddings/...part2/...embedding.json
  5. Query both:
    • run history
    • job-level lineage graph

Exact Evidence (from API)

A) Linear producer/consumer expectation

  • Producer stage (03_04_chunk_text) outputs 2 chunk files (part1, part2).
  • Consumer stage (04_05_embed_chunks) processes each chunk independently in separate runs.
  • Therefore, job-level lineage for worker_embed_chunks should show 2 upstream chunk edges.

B) Run history confirms 2 independent consumer runs

Command:

docker exec rag-lineage-marquez-1 sh -lc \
  "curl -sS 'http://localhost:5000/api/v1/namespaces/governed-rag/jobs/worker_embed_chunks/runs?limit=20'"

Extracted results:

  • RUN_COUNT: 2
  • RUN_INPUT_DATASETS:
    • 04_chunks/a204e8b00fc95a95e854f4a0.part1/5d6fa2bd4c8f9fda94a1b948ae0c5a5db139bc6884acac22705d828b2d1a09bd.chunk.json
    • 04_chunks/a204e8b00fc95a95e854f4a0.part2/40672b6cd77dc46285d180c7e0fe9f1c9e0725cfcbd6e12bfe415810b67d0bf8.chunk.json

C) Job lineage in-edge to embed shows only one upstream chunk

Command:

docker exec rag-lineage-marquez-1 sh -lc \
  "curl -sS 'http://localhost:5000/api/v1/lineage?nodeId=job%3Agoverned-rag%3Aworker_embed_chunks&depth=3'"

Extracted in-edge(s) to job:

  • "origin":"dataset:rag-data:04_chunks/...part2/...chunk.json","destination":"job:governed-rag:worker_embed_chunks"

So: upstream produces 2 chunk datasets and consumer runs ingest both, but job-level lineage edge list reflects only one upstream chunk.

Expected

At job-level lineage for worker_embed_chunks, I expect upstream connectivity to represent all relevant ingested runs (or a clearly documented mode if only latest run should be shown).

Actual

Job-level lineage appears partial, reflecting only one upstream edge, while API run history confirms multiple valid upstream datasets across runs.

Additional UI Observation (chronological)

  • Right after processing part1 only, UI lineage looked correct (connected chain for part1).
  • After processing part2, the part2 lineage looked correct, but part1 started appearing as a disconnected node.
  • This suggests previously visible upstream connectivity can be replaced/overwritten when a new run for the same job is ingested.

Why this matters

In UI/graph investigation, this looks like disconnected/isolated lineage branches even though events were accepted (201) and run history is complete.

Request

Please confirm whether this is:

  1. intended "latest/current run only" job-lineage behavior, or
  2. a bug/regression in lineage edge materialization for multi-run jobs.

If intended, is there a supported API/UI mode to view cumulative lineage across runs?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions