Skip to content

[Improve][Transform-V2][Embedding]Enhance multimodal embeddings#9996

Merged
davidzollo merged 19 commits into
apache:devfrom
loupipalien:enhance-multimodal-embeddings
Jun 29, 2026
Merged

[Improve][Transform-V2][Embedding]Enhance multimodal embeddings#9996
davidzollo merged 19 commits into
apache:devfrom
loupipalien:enhance-multimodal-embeddings

Conversation

@loupipalien

@loupipalien loupipalien commented Oct 29, 2025

Copy link
Copy Markdown
Contributor

Purpose of this pull request

Multi-field multimodal vectorization, doubao-embedding-vision supports multi-field multimodal mixing as input

vectorization_fields {
      multi_field_text_vector = [product_name, description]

      multi_field_image_vector = [
        {
          field = product_image_url
          modality = jpeg
          format = url
        },
        {
          field = thumbnail_image
          modality = png
          format = url
        }
      ]

      multi_field_video_vector = [
        {
          field = product_video_url
          modality = mp4
          format = url
        },
        {
          field = promotional_video
          modality = mov
          format = url
        }
      ]

      multi_field_mix_vector = [
        product_name,
        {
          field = product_image_url
          modality = jpeg
          format = url
        },
        {
          field = product_video_url
          modality = mp4
          format = url
        }
      ]
 }

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Add new test cases

Check list

@loupipalien loupipalien changed the title Enhance multimodal embeddings [Improve][Transform-V2][Embedding]Enhance multimodal embeddings Oct 30, 2025
@loupipalien loupipalien force-pushed the enhance-multimodal-embeddings branch from 2f8b47b to bdbd012 Compare November 4, 2025 14:46
@loupipalien

Copy link
Copy Markdown
Contributor Author

@Hisoka-X @corgy-w @xiaochen-zhou help to review if have time, thanks

davidzollo
davidzollo previously approved these changes Jan 1, 2026

@davidzollo davidzollo left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy new year!
+1

Comment on lines +208 to +226
for (Map.Entry<String, Object> fieldConfig : fieldsConfig.entrySet()) {
VectorFieldSpec vectorFieldSpec = new VectorFieldSpec(fieldConfig);
log.info("Vector field spec: {}", vectorFieldSpec);
List<String> srcFieldNames =
vectorFieldSpec.getSrcFieldSpecs().stream()
.map(SrcFieldSpec::getFieldName)
.collect(Collectors.toList());
List<Integer> srcFieldIndexes = new ArrayList<>();
for (String srcFieldName : srcFieldNames) {
try {
srcFieldIndexes.add(inputRowType.indexOf(srcFieldName));
} catch (IllegalArgumentException e) {
throw TransformCommonError.cannotFindInputFieldsError(
getPluginName(), srcFieldNames);
}
}
fieldSpecMap.put(srcFieldIndex, fieldSpec);
fieldNames.add(field.getKey());
isMultimodalFields = vectorFieldSpec.isMultimodalField();
fieldSpecMap.put(vectorFieldSpec, srcFieldIndexes);
fieldNames.add(vectorFieldSpec.getFieldName());

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isMultimodalFields = vectorFieldSpec.isMultimodalField();

There is a logical issue; currently, only the last value can be obtained.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your time, let me fix it

@davidzollo

Copy link
Copy Markdown
Contributor

Thank you for this excellent proposal! The refactoring to use VectorFieldSpec and SrcFieldSpec is a very elegant and extensible design for multimodal processing.

During the review, I found a few data-corruption bugs and missing requirements that must be addressed before this can be merged.

1. Output Field Order Mismatch (EmbeddingTransform.java)

Problem:
In initOutputFields(), fieldSpecMap is initialized as a HashMap. Since HashMap does not preserve insertion order, iterating over fieldSpecMap.keySet() in getOutputFieldValues() will generate the fieldValues array in a random order. This order will NOT match the order of fieldNames declared in getOutputColumns().
As a result, vectors will be written to the wrong columns downstream (e.g., text vector data inserted into the image vector column), causing severe data corruption in production.

Suggestion:
Please change it to LinkedHashMap to maintain insertion order:

// EmbeddingTransform.java
Map<VectorFieldSpec, List<Integer>> fieldSpecMap = new LinkedHashMap<>();

2. CRITICAL: Corrupted Binary payload in toBase64 (SrcField.java)

Problem:
In the toBase64() method, simply calling fieldValue.toString().getBytes() is extremely dangerous for binary data. If fieldValue is a raw byte[] (which happens when reading actual binary image/video streams), Java's .toString() on a byte array returns its memory address (e.g., [B@1a2b3c), NOT the content. The model will receive garbage string hashes instead of the actual image data.

Suggestion:
Add proper type checking for byte arrays to fix the format conversion:

// SrcField.java
public String toBase64() {
    if (fieldSpec == null || !fieldSpec.isBinary()) {
        throw new IllegalArgumentException("Payload format must be binary");
    }
    if (fieldValue == null) {
        throw new IllegalArgumentException("Binary data cannot be null or empty");
    }
    
    if (fieldValue instanceof byte[]) {
        return Base64.getEncoder().encodeToString((byte[]) fieldValue);
    } else {
        return Base64.getEncoder().encodeToString(String.valueOf(fieldValue).getBytes());
    }
}

3. Missing Assertions in Tests

In DoubaoMultimodalModelTest.java, you verified Assertions.assertTrue(inputNode.has("image_url")). However, considering the Base64 bug mentioned above, please update the test cases (especially testMultimodalBodyWithBinaryImage) to strictly assert the actual Base64 string value to ensure the image sequence is encoded correctly.

4. Missing Documentation (Blocker)

According to SeaTunnel's contribution guidelines, any user-visible configuration changes must be documented. The new config capabilities (like multi_field_image_vector array configurations) are highly valuable, but they are lacking documentation.
Please add usage examples and explanations in both:

  • docs/en/.../embedding-v2.md or related docs
  • docs/zh/.../embedding-v2.md

Looking forward to your updates! Let me know if you have any questions.

@loupipalien loupipalien force-pushed the enhance-multimodal-embeddings branch from 5f5745e to a54db18 Compare March 10, 2026 14:05
@github-actions github-actions Bot added the CI&CD label Mar 12, 2026
@loupipalien

Copy link
Copy Markdown
Contributor Author

@davidzollo Thanks for your patient guidance and valuable suggestions ❤️. Let me fix these problems.

@loupipalien loupipalien force-pushed the enhance-multimodal-embeddings branch from b67b4d4 to 568fab0 Compare March 14, 2026 15:56
davidzollo
davidzollo previously approved these changes Mar 15, 2026

@davidzollo davidzollo left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 if CI passes.
Good job,thanks for the efforts.
By the way, Testing with ThreadLocalRandom causes non-determinism; it is recommended to change to parameterized testing.

You can feel free to add my LinkedIn https://www.linkedin.com/in/davidzollo or WeChat(davidzollo) to build a connection ^_^

@DanielLeens DanielLeens left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. I re-reviewed the latest head against the full current PR diff and re-traced the actual multimodal runtime path from EmbeddingTransform into SrcField, MultimodalFieldValue, and DoubaoModel.

What this PR solves

  • User pain: the Embedding transform could not cleanly express or assemble mixed text / image / video inputs into one embedding request, and previous revisions could also let runtime suffix detection override an explicitly configured modality.
  • Fix approach: the PR splits the config/runtime responsibilities into VectorFieldSpec, SrcFieldSpec, and per-row SrcField, then builds the Doubao multimodal request body from those row-scoped source fields.
  • One-line summary: the latest head turns multimodal field assembly into a much healthier per-row flow, and I do not see a new source-level blocker on the current diff.

Runtime path rechecked

SeaTunnelRow enters EmbeddingTransform
  -> initOutputFields() parses vectorization_fields into VectorFieldSpec + source indexes
  -> getOutputFieldValues()
      -> build per-row SrcField objects
      -> SrcField.determineModalityType()
          -> keep explicit modality as source of truth
          -> only auto-detect from suffix when modality is unset and format=url
      -> wrap multimodal outputs as MultimodalFieldValue

Doubao request assembly
  -> DoubaoModel.multimodalVector()
      -> multimodalVectorGeneration()
          -> multimodalBody()
              -> inputRawData()
                  -> text -> {"type":"text","text":...}
                  -> image/video url -> {"type":"image_url"/"video_url", ...}
                  -> binary -> base64 data URL

Key findings

  1. The earlier explicit-modality blocker is fixed on the current head. SrcField now clones the config spec per row and returns early in determineModalityType() when the modality was explicitly configured or the payload is not URL-based.
  2. The normal path does hit this PR: EmbeddingTransform.getOutputFieldValues() now always builds row-scoped SrcField objects before the request is forwarded to DoubaoModel.
  3. The new regression coverage for explicit-modality preservation and text values ending with image-like suffixes is meaningful and matches the source fix.
  4. The current PR-level checks are green from GitHub's side (Build, Notify test workflow, labeler).

Findings

No new blocking source-level issue from me on the latest head.

There is one non-blocking test-quality point:

  1. ThreadLocalRandom is still used in the new test code
    • Location:
      • seatunnel-transforms-v2/src/test/java/org/apache/seatunnel/transform/embedding/DoubaoMultimodalModelTest.java:247
      • seatunnel-transforms-v2/src/test/java/org/apache/seatunnel/transform/embedding/MultimodalConfigTest.java:209
    • +1 to @davidzollo's earlier point about nondeterministic test construction. The current assertions are likely stable because both branches converge to the same expected result, so I would not block merge on this.
    • The downside is that each run only exercises one config shape at random, which makes branch coverage and later failure reproduction harder than necessary.
    • Better fix:
      • Option A: turn those cases into parameterized tests and cover both config shapes deterministically.
      • Option B: split them into two explicit test methods.
    • Severity: Medium, non-blocking.

Compatibility

  • Fully compatible from what I can see.
  • Existing single-field text configs still work.
  • The PR extends vectorization_fields to richer list/object forms without breaking the old string form.
  • I do not see an API, protocol, serialization, or historical-behavior break here.

Performance / side effects

  • The PR adds small per-row object allocation (SrcField) but removes the previous shared-state mutation risk.
  • The network cost increase for multimodal payloads is expected and comes from the feature itself, not from an accidental retry/logging/resource regression.
  • I do not see a new concurrency hazard on the latest head; the row-scoped clone is actually safer than the previous shared-spec behavior.

Test coverage and stability

  • Coverage: meaningful. The new UTs now cover explicit modality preservation, URL auto-detection, mixed-field text handling, and binary payload shaping.
  • Stability rating: some risk, non-blocking.
  • Reason: the new tests at the two locations above still choose one config shape randomly with ThreadLocalRandom, which makes coverage nondeterministic even though the current assertions are not obviously flaky.

Merge conclusion

Conclusion: can merge

  1. Blocking items
  • No new blocking issue from me on the latest head.
  1. Suggested follow-up
  • Replace the ThreadLocalRandom-driven test branches with deterministic parameterized or split test cases.

Overall, this revision is much cleaner than the earlier heads and the core multimodal runtime path now makes sense end to end. From the source-review side, I am okay with merging this version.

@loupipalien

Copy link
Copy Markdown
Contributor Author

@davidzollo I fixed the issue of automatic detection mentioned above

  1. Automatic detection is only performed when modalityType is not explicitly specified and payloadFormat is url.
  2. When modalityType and payloadFormat are both not specified, modalityType and payloadFormat defaults to Text, and no automatic detection is performed.

@DanielLeens

Copy link
Copy Markdown
Contributor

Thanks for the update here.

I checked the current PR head before replying, and it is still the same revision I reviewed previously (bb8677d87eabf2b8aa7bbbc7dc78bb147431acb1). So I can't confirm this fix from Daniel's side yet, because there is no new code on the branch for me to re-review.

From the description you posted, the direction sounds reasonable. Please push the actual code change to the PR branch, and once a new commit is there I will re-review the latest head end to end.

@loupipalien loupipalien force-pushed the enhance-multimodal-embeddings branch from bb8677d to 48816d7 Compare June 23, 2026 16:50

@DanielLeens DanielLeens left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. I re-reviewed the latest head from scratch, focusing on the net changes since Daniel’s previously approved revision.

What this latest head changes

  • User pain: provider invocation settings and failure handling were still split across the embedding providers, especially in the Doubao path.
  • Fix approach: EmbeddingTransform now passes ModelInvocationOptions down into the provider models, and DoubaoModel routes both text and multimodal calls through the shared invocation runtime with configurable request timeouts and structured error mapping.
  • One-line summary: the latest delta is mainly a runtime-hardening follow-up on top of the earlier multimodal work, not a new semantic rewrite of the embedding path.

Runtime chain rechecked

SeaTunnelRow
  -> EmbeddingTransform.open() [EmbeddingTransform.java:90-170]
      -> ModelInvocationOptions.fromConfig(config)
      -> new DoubaoModel(..., invocationOptions)

vector request
  -> DoubaoModel.textVector() / multimodalVector()
      -> invocationRuntime.invoke(...)
          -> requestTimeoutMs comes from ModelInvocationContext
          -> HTTP failures -> ModelInvocationException.fromHttpStatus(...)
          -> parse failures -> ModelInvocationException.nonRetryable(...)

Key findings

  1. The normal embedding path definitely hits this change because all provider instances are created from EmbeddingTransform.open().
  2. DoubaoModel.java:139-328 now uses the shared invocation runtime instead of keeping fixed 20-second timeouts and ad-hoc exception handling inside the provider.
  3. I do not see a reopened source-level blocker in the transform path on this latest head.
  4. The current Build is red, but the failing jobs I checked are connector / JDBC integration lanes rather than the transform runtime path. transform-v2-it-part-1 is green.

Findings

Issue 1: the PR-level Build is red on the latest head

  • Location: GitHub CI Build
  • Why it matters: even though the transform path itself looks fine to me, the PR still needs a clean Build result before merge.
  • CI evidence I checked:
    • failing lanes include all-connectors-it-3, all-connectors-it-7, jdbc-connectors-it-part-5, jdbc-connectors-it-part-6, and jdbc-connectors-it-ddl
    • transform-v2-it-part-1 is passing
  • Severity: Medium

Compatibility / side effects

  • I do not see an API, protocol, serialization, or backward-compatibility break here.
  • The latest changes make the provider invocation path more uniform and improve resource handling with try (CloseableHttpResponse ...).

Test coverage / stability

  • The latest net delta does not add new test code, so there is no new test-stability blocker to call out in this round.

Merge conclusion

Conclusion: can merge after fixes

  1. Blocking items
  • Issue 1: please get the current Build green on this latest head.
  1. Suggested follow-up
  • No new source-side change is required from this review.

Overall, I do not see a new code-side blocker in the embedding runtime path. What still blocks merge for me is the red PR-level Build, not the transform logic itself.

@loupipalien loupipalien force-pushed the enhance-multimodal-embeddings branch from 48816d7 to afd0fd7 Compare June 28, 2026 17:39

@DanielLeens DanielLeens left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. I re-reviewed the latest head from scratch against the full current diff and re-traced the real multimodal runtime path from EmbeddingTransform into SrcField, MultimodalFieldValue, and DoubaoModel.

What this PR solves

  • User pain: the transform needed a cleaner way to express mixed text / image / video inputs, and the provider invocation path still had provider-local timeout / error handling that was harder to reason about.
  • Fix approach: this revision splits config-time vector specs from row-time source fields, keeps modality detection row-local, and routes the Doubao text / multimodal calls through the shared invocation runtime.
  • One-line summary: the current head makes the multimodal embedding path much healthier, and I do not see a new source-level blocker in the latest code.

Runtime path I checked

config parsing
  -> EmbeddingTransform.initOutputFields(...) [EmbeddingTransform.java:206-237]
      -> VectorFieldSpec(fieldConfig)
      -> one or more SrcFieldSpec entries per output vector field

row-time execution
  -> EmbeddingTransform.getOutputFieldValues(...) [EmbeddingTransform.java:241-268]
      -> new SrcField(spec, rowValue) [SrcField.java:34-44]
          -> determineModalityType() [SrcField.java:46-63]
             -> keep explicit modality as source of truth
             -> only auto-detect from suffix when modality is unset and format=url
      -> wrap row-scoped multimodal inputs as MultimodalFieldValue when needed

provider call
  -> model.vectorization(...)
      -> DoubaoModel.textVector(...) / multimodalVector(...) [DoubaoModel.java:137-183]
      -> invocationRuntime.invoke(...)
      -> multimodalVectorGeneration(..., requestTimeoutMs) [DoubaoModel.java:286-323]

Key findings

  1. The normal transform path definitely hits this change, because both EmbeddingTransform.open() and getOutputFieldValues(...) are part of the main row-processing path.
  2. +1 to @davidzollo's earlier point from the older rounds about explicit modality ownership. I traced the same path on the current head, and that blocker is fixed now: SrcField.determineModalityType() returns early when the modality was explicitly configured or the payload is not URL-based, so the runtime suffix no longer overwrites explicit user config.
  3. The row-local SrcField construction also keeps runtime modality inference from leaking back into shared config state.
  4. I did not find a new source-level blocker in the current head.

Other reviewer / maintainer input

  • I noticed @davidzollo had previously raised the explicit-modality override concern. After rechecking SrcField.java:46-63 and the current tests, I agree that this point was important, and I also agree it is fixed in the latest head.

Testing / stability

  • The new source-level regression coverage is meaningful: config parsing, explicit-modality preservation, mixed text/image/video request assembly, and binary payload shaping are all covered.
  • There is still one non-blocking test-quality follow-up: ThreadLocalRandom is still used in DoubaoMultimodalModelTest.java:247 and MultimodalConfigTest.java:209, which makes branch coverage nondeterministic even though the current assertions are not obviously flaky.
  • Stability rating for the new tests: some risk, non-blocking.

CI / merge gate

  • The PR-level Build check currently still shows completed / failure on head afd0fd7b25fb4fa0fd2e1baa2ecd6e1f9545210e.
  • I followed the linked workflow metadata from that check run, and it points to loupipalien/seatunnel workflow run 28330521110, which was still in_progress when I checked.
  • The already-finished transform-relevant lanes I saw there were green, including transform-v2-it-part-1, unit-test (11, ubuntu-latest), Dead links, and Code style.
  • So from the evidence currently visible to me, I do not have a transform-path code failure to point to, but I also would not treat the PR as workflow-clean until that Build state settles.

Merge conclusion: can merge after fixes

  1. Blocking items
  • Please get the current Build signal into a stable green state before merging. If the underlying workflow finishes green but the PR check still stays red, I would treat that as a stale / sync issue and rerun or refresh the aggregate check.
  1. Suggested follow-up
  • Replace the ThreadLocalRandom-driven test branches with deterministic parameterized or split tests.

Overall, the latest revision is in much better shape from the actual embedding runtime path. I do not see a new source-side blocker on this head; what still stops me from calling it merge-ready is the unsettled PR-level Build status rather than the transform code itself.

@DanielLeens DanielLeens left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. I rechecked the same head afd0fd7b25fb4fa0fd2e1baa2ecd6e1f9545210e, and the only blocker from my previous pass was the unsettled Build state rather than a transform-path code issue.

The checks on this unchanged head are green now, and my source-level conclusion is unchanged: I still do not see a blocker in the multimodal embedding runtime path I traced (EmbeddingTransform -> SrcField / MultimodalFieldValue -> DoubaoModel).

Approving from my side.

@davidzollo davidzollo left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
LGTM

@davidzollo davidzollo merged commit 1863e33 into apache:dev Jun 29, 2026
5 checks passed
chl-wxp pushed a commit to chl-wxp/seatunnel-laowang that referenced this pull request Jul 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants