[Improve][Transform-V2][Embedding]Enhance multimodal embeddings by loupipalien · Pull Request #9996 · apache/seatunnel

loupipalien · 2025-10-29T18:18:58Z

Purpose of this pull request

Multi-field multimodal vectorization, doubao-embedding-vision supports multi-field multimodal mixing as input

vectorization_fields {
      multi_field_text_vector = [product_name, description]

      multi_field_image_vector = [
        {
          field = product_image_url
          modality = jpeg
          format = url
        },
        {
          field = thumbnail_image
          modality = png
          format = url
        }
      ]

      multi_field_video_vector = [
        {
          field = product_video_url
          modality = mp4
          format = url
        },
        {
          field = promotional_video
          modality = mov
          format = url
        }
      ]

      multi_field_mix_vector = [
        product_name,
        {
          field = product_image_url
          modality = jpeg
          format = url
        },
        {
          field = product_video_url
          modality = mp4
          format = url
        }
      ]
 }

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Add new test cases

Check list

If any new Jar binary package adding in your PR, please add License Notice according
New License Guide
If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
If you are contributing the connector code, please check that the following files are updated:
1. Update plugin-mapping.properties and add new connector information in it
2. Update the pom file of seatunnel-dist
3. Add ci label in label-scope-conf
4. Add e2e testcase in seatunnel-e2e
5. Update connector plugin_config

loupipalien · 2025-11-05T13:40:05Z

@Hisoka-X @corgy-w @xiaochen-zhou help to review if have time, thanks

davidzollo

Happy new year!
+1

zhangshenghang · 2026-01-04T12:26:20Z

+        for (Map.Entry<String, Object> fieldConfig : fieldsConfig.entrySet()) {
+            VectorFieldSpec vectorFieldSpec = new VectorFieldSpec(fieldConfig);
+            log.info("Vector field spec: {}", vectorFieldSpec);
+            List<String> srcFieldNames =
+                    vectorFieldSpec.getSrcFieldSpecs().stream()
+                            .map(SrcFieldSpec::getFieldName)
+                            .collect(Collectors.toList());
+            List<Integer> srcFieldIndexes = new ArrayList<>();
+            for (String srcFieldName : srcFieldNames) {
+                try {
+                    srcFieldIndexes.add(inputRowType.indexOf(srcFieldName));
+                } catch (IllegalArgumentException e) {
+                    throw TransformCommonError.cannotFindInputFieldsError(
+                            getPluginName(), srcFieldNames);
+                }
            }
-            fieldSpecMap.put(srcFieldIndex, fieldSpec);
-            fieldNames.add(field.getKey());
+            isMultimodalFields = vectorFieldSpec.isMultimodalField();
+            fieldSpecMap.put(vectorFieldSpec, srcFieldIndexes);
+            fieldNames.add(vectorFieldSpec.getFieldName());


isMultimodalFields = vectorFieldSpec.isMultimodalField();

There is a logical issue; currently, only the last value can be obtained.

@loupipalien

Thanks for your time, let me fix it

davidzollo · 2026-03-02T13:47:10Z

Thank you for this excellent proposal! The refactoring to use VectorFieldSpec and SrcFieldSpec is a very elegant and extensible design for multimodal processing.

During the review, I found a few data-corruption bugs and missing requirements that must be addressed before this can be merged.

1. Output Field Order Mismatch (`EmbeddingTransform.java`)

Problem:
In initOutputFields(), fieldSpecMap is initialized as a HashMap. Since HashMap does not preserve insertion order, iterating over fieldSpecMap.keySet() in getOutputFieldValues() will generate the fieldValues array in a random order. This order will NOT match the order of fieldNames declared in getOutputColumns().
As a result, vectors will be written to the wrong columns downstream (e.g., text vector data inserted into the image vector column), causing severe data corruption in production.

Suggestion:
Please change it to LinkedHashMap to maintain insertion order:

// EmbeddingTransform.java
Map<VectorFieldSpec, List<Integer>> fieldSpecMap = new LinkedHashMap<>();

2. CRITICAL: Corrupted Binary payload in `toBase64` (`SrcField.java`)

Problem:
In the toBase64() method, simply calling fieldValue.toString().getBytes() is extremely dangerous for binary data. If fieldValue is a raw byte[] (which happens when reading actual binary image/video streams), Java's .toString() on a byte array returns its memory address (e.g., [B@1a2b3c), NOT the content. The model will receive garbage string hashes instead of the actual image data.

Suggestion:
Add proper type checking for byte arrays to fix the format conversion:

// SrcField.java
public String toBase64() {
    if (fieldSpec == null || !fieldSpec.isBinary()) {
        throw new IllegalArgumentException("Payload format must be binary");
    }
    if (fieldValue == null) {
        throw new IllegalArgumentException("Binary data cannot be null or empty");
    }
    
    if (fieldValue instanceof byte[]) {
        return Base64.getEncoder().encodeToString((byte[]) fieldValue);
    } else {
        return Base64.getEncoder().encodeToString(String.valueOf(fieldValue).getBytes());
    }
}

3. Missing Assertions in Tests

In DoubaoMultimodalModelTest.java, you verified Assertions.assertTrue(inputNode.has("image_url")). However, considering the Base64 bug mentioned above, please update the test cases (especially testMultimodalBodyWithBinaryImage) to strictly assert the actual Base64 string value to ensure the image sequence is encoded correctly.

4. Missing Documentation (Blocker)

According to SeaTunnel's contribution guidelines, any user-visible configuration changes must be documented. The new config capabilities (like multi_field_image_vector array configurations) are highly valuable, but they are lacking documentation.
Please add usage examples and explanations in both:

docs/en/.../embedding-v2.md or related docs
docs/zh/.../embedding-v2.md

Looking forward to your updates! Let me know if you have any questions.

loupipalien · 2026-03-14T15:51:08Z

@davidzollo Thanks for your patient guidance and valuable suggestions ❤️. Let me fix these problems.

davidzollo

+1 if CI passes.
Good job，thanks for the efforts.
By the way, Testing with ThreadLocalRandom causes non-determinism; it is recommended to change to parameterized testing.

You can feel free to add my LinkedIn https://www.linkedin.com/in/davidzollo or WeChat(davidzollo) to build a connection ^_^

DanielLeens

Thanks for the update. I re-reviewed the latest head against the full current PR diff and re-traced the actual multimodal runtime path from EmbeddingTransform into SrcField, MultimodalFieldValue, and DoubaoModel.

What this PR solves

User pain: the Embedding transform could not cleanly express or assemble mixed text / image / video inputs into one embedding request, and previous revisions could also let runtime suffix detection override an explicitly configured modality.
Fix approach: the PR splits the config/runtime responsibilities into VectorFieldSpec, SrcFieldSpec, and per-row SrcField, then builds the Doubao multimodal request body from those row-scoped source fields.
One-line summary: the latest head turns multimodal field assembly into a much healthier per-row flow, and I do not see a new source-level blocker on the current diff.

Runtime path rechecked

SeaTunnelRow enters EmbeddingTransform
  -> initOutputFields() parses vectorization_fields into VectorFieldSpec + source indexes
  -> getOutputFieldValues()
      -> build per-row SrcField objects
      -> SrcField.determineModalityType()
          -> keep explicit modality as source of truth
          -> only auto-detect from suffix when modality is unset and format=url
      -> wrap multimodal outputs as MultimodalFieldValue

Doubao request assembly
  -> DoubaoModel.multimodalVector()
      -> multimodalVectorGeneration()
          -> multimodalBody()
              -> inputRawData()
                  -> text -> {"type":"text","text":...}
                  -> image/video url -> {"type":"image_url"/"video_url", ...}
                  -> binary -> base64 data URL

Key findings

The earlier explicit-modality blocker is fixed on the current head. SrcField now clones the config spec per row and returns early in determineModalityType() when the modality was explicitly configured or the payload is not URL-based.
The normal path does hit this PR: EmbeddingTransform.getOutputFieldValues() now always builds row-scoped SrcField objects before the request is forwarded to DoubaoModel.
The new regression coverage for explicit-modality preservation and text values ending with image-like suffixes is meaningful and matches the source fix.
The current PR-level checks are green from GitHub's side (Build, Notify test workflow, labeler).

Findings

No new blocking source-level issue from me on the latest head.

There is one non-blocking test-quality point:

ThreadLocalRandom is still used in the new test code
- Location:
  - seatunnel-transforms-v2/src/test/java/org/apache/seatunnel/transform/embedding/DoubaoMultimodalModelTest.java:247
  - seatunnel-transforms-v2/src/test/java/org/apache/seatunnel/transform/embedding/MultimodalConfigTest.java:209
- +1 to @davidzollo's earlier point about nondeterministic test construction. The current assertions are likely stable because both branches converge to the same expected result, so I would not block merge on this.
- The downside is that each run only exercises one config shape at random, which makes branch coverage and later failure reproduction harder than necessary.
- Better fix:
  - Option A: turn those cases into parameterized tests and cover both config shapes deterministically.
  - Option B: split them into two explicit test methods.
- Severity: Medium, non-blocking.

Compatibility

Fully compatible from what I can see.
Existing single-field text configs still work.
The PR extends vectorization_fields to richer list/object forms without breaking the old string form.
I do not see an API, protocol, serialization, or historical-behavior break here.

Performance / side effects

The PR adds small per-row object allocation (SrcField) but removes the previous shared-state mutation risk.
The network cost increase for multimodal payloads is expected and comes from the feature itself, not from an accidental retry/logging/resource regression.
I do not see a new concurrency hazard on the latest head; the row-scoped clone is actually safer than the previous shared-spec behavior.

Test coverage and stability

Coverage: meaningful. The new UTs now cover explicit modality preservation, URL auto-detection, mixed-field text handling, and binary payload shaping.
Stability rating: some risk, non-blocking.
Reason: the new tests at the two locations above still choose one config shape randomly with ThreadLocalRandom, which makes coverage nondeterministic even though the current assertions are not obviously flaky.

Merge conclusion

Conclusion: can merge

Blocking items

No new blocking issue from me on the latest head.

Suggested follow-up

Replace the ThreadLocalRandom-driven test branches with deterministic parameterized or split test cases.

Overall, this revision is much cleaner than the earlier heads and the core multimodal runtime path now makes sense end to end. From the source-review side, I am okay with merging this version.

loupipalien · 2026-06-23T03:54:30Z

@davidzollo I fixed the issue of automatic detection mentioned above

Automatic detection is only performed when modalityType is not explicitly specified and payloadFormat is url.
When modalityType and payloadFormat are both not specified, modalityType and payloadFormat defaults to Text, and no automatic detection is performed.

DanielLeens · 2026-06-23T11:02:21Z

Thanks for the update here.

I checked the current PR head before replying, and it is still the same revision I reviewed previously (bb8677d87eabf2b8aa7bbbc7dc78bb147431acb1). So I can't confirm this fix from Daniel's side yet, because there is no new code on the branch for me to re-review.

From the description you posted, the direction sounds reasonable. Please push the actual code change to the PR branch, and once a new commit is there I will re-review the latest head end to end.

DanielLeens

Thanks for the update. I re-reviewed the latest head from scratch, focusing on the net changes since Daniel’s previously approved revision.

What this latest head changes

User pain: provider invocation settings and failure handling were still split across the embedding providers, especially in the Doubao path.
Fix approach: EmbeddingTransform now passes ModelInvocationOptions down into the provider models, and DoubaoModel routes both text and multimodal calls through the shared invocation runtime with configurable request timeouts and structured error mapping.
One-line summary: the latest delta is mainly a runtime-hardening follow-up on top of the earlier multimodal work, not a new semantic rewrite of the embedding path.

Runtime chain rechecked

SeaTunnelRow
  -> EmbeddingTransform.open() [EmbeddingTransform.java:90-170]
      -> ModelInvocationOptions.fromConfig(config)
      -> new DoubaoModel(..., invocationOptions)

vector request
  -> DoubaoModel.textVector() / multimodalVector()
      -> invocationRuntime.invoke(...)
          -> requestTimeoutMs comes from ModelInvocationContext
          -> HTTP failures -> ModelInvocationException.fromHttpStatus(...)
          -> parse failures -> ModelInvocationException.nonRetryable(...)

Key findings

The normal embedding path definitely hits this change because all provider instances are created from EmbeddingTransform.open().
DoubaoModel.java:139-328 now uses the shared invocation runtime instead of keeping fixed 20-second timeouts and ad-hoc exception handling inside the provider.
I do not see a reopened source-level blocker in the transform path on this latest head.
The current Build is red, but the failing jobs I checked are connector / JDBC integration lanes rather than the transform runtime path. transform-v2-it-part-1 is green.

Findings

Issue 1: the PR-level Build is red on the latest head

Location: GitHub CI Build
Why it matters: even though the transform path itself looks fine to me, the PR still needs a clean Build result before merge.
CI evidence I checked:
- failing lanes include all-connectors-it-3, all-connectors-it-7, jdbc-connectors-it-part-5, jdbc-connectors-it-part-6, and jdbc-connectors-it-ddl
- transform-v2-it-part-1 is passing
Severity: Medium

Compatibility / side effects

I do not see an API, protocol, serialization, or backward-compatibility break here.
The latest changes make the provider invocation path more uniform and improve resource handling with try (CloseableHttpResponse ...).

Test coverage / stability

The latest net delta does not add new test code, so there is no new test-stability blocker to call out in this round.

Merge conclusion

Conclusion: can merge after fixes

Blocking items

Issue 1: please get the current Build green on this latest head.

Suggested follow-up

No new source-side change is required from this review.

Overall, I do not see a new code-side blocker in the embedding runtime path. What still blocks merge for me is the red PR-level Build, not the transform logic itself.

…nd format is url

DanielLeens

Thanks for the update. I re-reviewed the latest head from scratch against the full current diff and re-traced the real multimodal runtime path from EmbeddingTransform into SrcField, MultimodalFieldValue, and DoubaoModel.

What this PR solves

User pain: the transform needed a cleaner way to express mixed text / image / video inputs, and the provider invocation path still had provider-local timeout / error handling that was harder to reason about.
Fix approach: this revision splits config-time vector specs from row-time source fields, keeps modality detection row-local, and routes the Doubao text / multimodal calls through the shared invocation runtime.
One-line summary: the current head makes the multimodal embedding path much healthier, and I do not see a new source-level blocker in the latest code.

Runtime path I checked

config parsing
  -> EmbeddingTransform.initOutputFields(...) [EmbeddingTransform.java:206-237]
      -> VectorFieldSpec(fieldConfig)
      -> one or more SrcFieldSpec entries per output vector field

row-time execution
  -> EmbeddingTransform.getOutputFieldValues(...) [EmbeddingTransform.java:241-268]
      -> new SrcField(spec, rowValue) [SrcField.java:34-44]
          -> determineModalityType() [SrcField.java:46-63]
             -> keep explicit modality as source of truth
             -> only auto-detect from suffix when modality is unset and format=url
      -> wrap row-scoped multimodal inputs as MultimodalFieldValue when needed

provider call
  -> model.vectorization(...)
      -> DoubaoModel.textVector(...) / multimodalVector(...) [DoubaoModel.java:137-183]
      -> invocationRuntime.invoke(...)
      -> multimodalVectorGeneration(..., requestTimeoutMs) [DoubaoModel.java:286-323]

Key findings

The normal transform path definitely hits this change, because both EmbeddingTransform.open() and getOutputFieldValues(...) are part of the main row-processing path.
+1 to @davidzollo's earlier point from the older rounds about explicit modality ownership. I traced the same path on the current head, and that blocker is fixed now: SrcField.determineModalityType() returns early when the modality was explicitly configured or the payload is not URL-based, so the runtime suffix no longer overwrites explicit user config.
The row-local SrcField construction also keeps runtime modality inference from leaking back into shared config state.
I did not find a new source-level blocker in the current head.

Other reviewer / maintainer input

I noticed @davidzollo had previously raised the explicit-modality override concern. After rechecking SrcField.java:46-63 and the current tests, I agree that this point was important, and I also agree it is fixed in the latest head.

Testing / stability

The new source-level regression coverage is meaningful: config parsing, explicit-modality preservation, mixed text/image/video request assembly, and binary payload shaping are all covered.
There is still one non-blocking test-quality follow-up: ThreadLocalRandom is still used in DoubaoMultimodalModelTest.java:247 and MultimodalConfigTest.java:209, which makes branch coverage nondeterministic even though the current assertions are not obviously flaky.
Stability rating for the new tests: some risk, non-blocking.

CI / merge gate

The PR-level Build check currently still shows completed / failure on head afd0fd7b25fb4fa0fd2e1baa2ecd6e1f9545210e.
I followed the linked workflow metadata from that check run, and it points to loupipalien/seatunnel workflow run 28330521110, which was still in_progress when I checked.
The already-finished transform-relevant lanes I saw there were green, including transform-v2-it-part-1, unit-test (11, ubuntu-latest), Dead links, and Code style.
So from the evidence currently visible to me, I do not have a transform-path code failure to point to, but I also would not treat the PR as workflow-clean until that Build state settles.

Merge conclusion: can merge after fixes

Blocking items

Please get the current Build signal into a stable green state before merging. If the underlying workflow finishes green but the PR check still stays red, I would treat that as a stale / sync issue and rerun or refresh the aggregate check.

Suggested follow-up

Replace the ThreadLocalRandom-driven test branches with deterministic parameterized or split tests.

Overall, the latest revision is in much better shape from the actual embedding runtime path. I do not see a new source-side blocker on this head; what still stops me from calling it merge-ready is the unsettled PR-level Build status rather than the transform code itself.

DanielLeens

Thanks for the update. I rechecked the same head afd0fd7b25fb4fa0fd2e1baa2ecd6e1f9545210e, and the only blocker from my previous pass was the unsettled Build state rather than a transform-path code issue.

The checks on this unchanged head are green now, and my source-level conclusion is unchanged: I still do not see a blocker in the multimodal embedding runtime path I traced (EmbeddingTransform -> SrcField / MultimodalFieldValue -> DoubaoModel).

Approving from my side.

davidzollo

+1
LGTM

…he#9996)

github-actions Bot added Transform-v2 e2e labels Oct 29, 2025

loupipalien changed the title ~~Enhance multimodal embeddings~~ [Improve][Transform-V2][Embedding]Enhance multimodal embeddings Oct 30, 2025

loupipalien force-pushed the enhance-multimodal-embeddings branch from 2f8b47b to bdbd012 Compare November 4, 2025 14:46

davidzollo previously approved these changes Jan 1, 2026

View reviewed changes

github-actions Bot added approved reviewed labels Jan 1, 2026

zhangshenghang reviewed Jan 4, 2026

View reviewed changes

loupipalien dismissed davidzollo’s stale review via d497cba February 13, 2026 16:56

loupipalien force-pushed the enhance-multimodal-embeddings branch from bdbd012 to d497cba Compare February 13, 2026 16:56

github-actions Bot removed approved reviewed labels Feb 13, 2026

davidzollo requested a review from zhangshenghang February 15, 2026 14:00

loupipalien force-pushed the enhance-multimodal-embeddings branch from 0c8a06b to 5f5745e Compare March 1, 2026 08:52

loupipalien force-pushed the enhance-multimodal-embeddings branch from 5f5745e to a54db18 Compare March 10, 2026 14:05

github-actions Bot added the CI&CD label Mar 12, 2026

loupipalien requested review from corgy-w and davidzollo March 14, 2026 04:13

loupipalien force-pushed the enhance-multimodal-embeddings branch from b67b4d4 to 568fab0 Compare March 14, 2026 15:56

github-actions Bot added the document label Mar 14, 2026

davidzollo previously approved these changes Mar 15, 2026

View reviewed changes

github-actions Bot added approved reviewed labels Mar 15, 2026

loupipalien force-pushed the enhance-multimodal-embeddings branch from 568fab0 to 59cecda Compare March 21, 2026 15:50

loupipalien dismissed davidzollo’s stale review via d7bcb01 March 23, 2026 19:15

loupipalien force-pushed the enhance-multimodal-embeddings branch from 59cecda to d7bcb01 Compare March 23, 2026 19:15

github-actions Bot removed the approved label Mar 23, 2026

DanielLeens approved these changes Jun 20, 2026

View reviewed changes

loupipalien force-pushed the enhance-multimodal-embeddings branch from bb8677d to 48816d7 Compare June 23, 2026 16:50

DanielLeens reviewed Jun 24, 2026

View reviewed changes

loupipalien added 19 commits June 29, 2026 01:38

enhance multimodal embeddings

2c285d6

fix: fix VectorFieldSpecTest test case

5af23f6

chore: add spotless toggleOffOn

7c994be

chore: code format

593fef9

chore:remove deduplicated code

4e9f707

chore: modify comment

1209fdb

fix: fix set isMultimodalFileds

85f2dbe

test: upgrade testcontainers for docker test

f2703aa

chore: add log

e962868

fix: use reference filter instead of image name filter

a5da9f2

chore: increase connector-file-local-it job timeout

a31cac3

fix: fix some bugs and add docs

d3fdca9

chore: increase all-connectors-it-1 job timeout

ecac9fb

chore: update docs and remove unnecessary comment

10856f4

modify: create a new object avoid to share SrcFieldSpec

55d9a91

chore: revert testcontainer version and ci timeout minutes

b333d2b

chore: revert to image name filter

d3ca789

chore: remove spotless toggleOffOn

5724f00

fix: only detect modality when modality not explicitly configurated a…

afd0fd7

…nd format is url

loupipalien force-pushed the enhance-multimodal-embeddings branch from 48816d7 to afd0fd7 Compare June 28, 2026 17:39

DanielLeens reviewed Jun 29, 2026

View reviewed changes

DanielLeens approved these changes Jun 29, 2026

View reviewed changes

davidzollo approved these changes Jun 29, 2026

View reviewed changes

davidzollo merged commit 1863e33 into apache:dev Jun 29, 2026
5 checks passed

chl-wxp pushed a commit to chl-wxp/seatunnel-laowang that referenced this pull request Jul 1, 2026

[Improve][Transform-V2][Embedding]Enhance multimodal embeddings (apac…

57c3d82

…he#9996)

Uh oh!

Conversation

loupipalien commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

Uh oh!

loupipalien commented Nov 5, 2025

Uh oh!

davidzollo left a comment

Choose a reason for hiding this comment

Uh oh!

zhangshenghang Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

corgy-w Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

loupipalien Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

davidzollo commented Mar 2, 2026

1. Output Field Order Mismatch (EmbeddingTransform.java)

2. CRITICAL: Corrupted Binary payload in toBase64 (SrcField.java)

3. Missing Assertions in Tests

4. Missing Documentation (Blocker)

Uh oh!

loupipalien commented Mar 14, 2026

Uh oh!

davidzollo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DanielLeens left a comment

Choose a reason for hiding this comment

What this PR solves

Runtime path rechecked

Key findings

Findings

Compatibility

Performance / side effects

Test coverage and stability

Merge conclusion

Conclusion: can merge

Uh oh!

loupipalien commented Jun 23, 2026

Uh oh!

DanielLeens commented Jun 23, 2026

Uh oh!

DanielLeens left a comment

Choose a reason for hiding this comment

What this latest head changes

Runtime chain rechecked

Key findings

Findings

Compatibility / side effects

Test coverage / stability

Merge conclusion

Conclusion: can merge after fixes

Uh oh!

DanielLeens left a comment

Choose a reason for hiding this comment

Merge conclusion: can merge after fixes

Uh oh!

DanielLeens left a comment

Choose a reason for hiding this comment

Uh oh!

davidzollo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

loupipalien commented Oct 29, 2025 •

edited

Loading

1. Output Field Order Mismatch (`EmbeddingTransform.java`)

2. CRITICAL: Corrupted Binary payload in `toBase64` (`SrcField.java`)

davidzollo left a comment •

edited

Loading