[Fix][Connector-V2] Fix TiDB CDC silent row event loss when resolvedT… by zhaoysg · Pull Request #11113 · apache/seatunnel

zhaoysg · 2026-06-17T02:06:22Z

…s advances ahead of prewrite replay (#11013)

The TiDB CDC source could silently drop row change events while resolvedTs kept advancing, as reported in #11013. The root cause is in the CDC streaming reader: captureStreamingEvents advanced resolvedTs to cdcClient.getMaxResolvedTs() at the end of every batch, but the corresponding commit/prewrite pair was consumed via commits.pollFirstEntry().getValue() without verifying that the matching prewrite row had actually been buffered. When a commit row arrived before its prewrite (which happens during region split, rebalance, or under backpressure), the commit was discarded while resolvedTs was still pushed forward past the commit timestamp, causing the engine to skip subsequent real events at that timestamp without any error.

This commit fixes the issue and adds minimal diagnostic logging to help users diagnose similar scenarios in the future.

Changes

TiDBSourceReader.captureStreamingEvents / flushRows

flushRows now returns a safeResolvedTs. If a commit row has no matching prewrite row in preWrites yet, the function stops advancing resolvedTs to the commit timestamp and holds the commit in the queue, so the next batch can re-pair it with the late prewrite.
flushRows now only pollFirstEntry()s a commit row after confirming its prewrite is present, fixing the previous poll-then-leak behaviour.
The batch loop distinguishes between a "no row right now" return from cdcClient.get() and a "resolvedTs advanced but no row" case, and no longer exits early when only resolvedTs moved.

TiDBSourceReader.snapshotEvents

The snapshot scan no longer relies on a single KVClient.scan call over the whole keyRange. It now uses a RegionStoreClient per region and walks forward using region.getEndKey(), which is correct across region boundaries and stops cleanly when the scan reaches the end of the requested key range.
Adds scannedCount / emittedCount log on snapshot completion.

TableKeyRangeUtils

isRecordKey now guards against short keys (key.length > 10) to avoid ArrayIndexOutOfBoundsException on truncated payloads.
Switches from Guava ImmutableList to JDK collections to keep the module free of Guava dependencies on this hot path.

TiDBSourceSplitEnumerator

Adds structured INFO logs for enumerator open, split enumeration, and per-reader split assignment to make split distribution observable in production.

TiDBSourceReader

Adds [TiDB-CDC-DIAG] streaming stats logging (every 10s or when a single batch exceeds 1s) exposing resolvedLagMs, polled/emitted/committed counters, and pull/flush/emit latencies. This is the primary signal users can use to detect a prewrite/commit imbalance in the future.

Fixes #11013

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

If any new Jar binary package adding in your PR, please add License Notice according
New License Guide
If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
If necessary, please update incompatible-changes.md to describe the incompatibility caused by this PR.
If you are contributing the connector code, please check that the following files are updated:
1. Update plugin-mapping.properties and add new connector information in it
2. Update the pom file of seatunnel-dist
3. Add ci label in label-scope-conf
4. Add e2e testcase in seatunnel-e2e
5. Update connector plugin_config

…s advances ahead of prewrite replay (apache#11013) The TiDB CDC source could silently drop row change events while `resolvedTs` kept advancing, as reported in apache#11013. The root cause is in the CDC streaming reader: `captureStreamingEvents` advanced `resolvedTs` to `cdcClient.getMaxResolvedTs()` at the end of every batch, but the corresponding commit/prewrite pair was consumed via `commits.pollFirstEntry().getValue()` without verifying that the matching prewrite row had actually been buffered. When a commit row arrived before its prewrite (which happens during region split, rebalance, or under backpressure), the commit was discarded while `resolvedTs` was still pushed forward past the commit timestamp, causing the engine to skip subsequent real events at that timestamp without any error. This commit fixes the issue and adds minimal diagnostic logging to help users diagnose similar scenarios in the future. Changes ------- `TiDBSourceReader.captureStreamingEvents` / `flushRows` - `flushRows` now returns a `safeResolvedTs`. If a commit row has no matching prewrite row in `preWrites` yet, the function stops advancing `resolvedTs` to the commit timestamp and holds the commit in the queue, so the next batch can re-pair it with the late prewrite. - `flushRows` now only `pollFirstEntry()`s a commit row after confirming its prewrite is present, fixing the previous poll-then-leak behaviour. - The batch loop distinguishes between a "no row right now" return from `cdcClient.get()` and a "resolvedTs advanced but no row" case, and no longer exits early when only `resolvedTs` moved. `TiDBSourceReader.snapshotEvents` - The snapshot scan no longer relies on a single `KVClient.scan` call over the whole `keyRange`. It now uses a `RegionStoreClient` per region and walks forward using `region.getEndKey()`, which is correct across region boundaries and stops cleanly when the scan reaches the end of the requested key range. - Adds `scannedCount` / `emittedCount` log on snapshot completion. `TableKeyRangeUtils` - `isRecordKey` now guards against short keys (`key.length > 10`) to avoid `ArrayIndexOutOfBoundsException` on truncated payloads. - Switches from Guava `ImmutableList` to JDK collections to keep the module free of Guava dependencies on this hot path. `TiDBSourceSplitEnumerator` - Adds structured INFO logs for enumerator open, split enumeration, and per-reader split assignment to make split distribution observable in production. `TiDBSourceReader` - Adds `[TiDB-CDC-DIAG]` streaming stats logging (every 10s or when a single batch exceeds 1s) exposing `resolvedLagMs`, polled/emitted/committed counters, and pull/flush/emit latencies. This is the primary signal users can use to detect a prewrite/commit imbalance in the future. Fixes apache#11013

…flush diagnostics This is a follow-up to apache#11013 (Fix TiDB CDC silent row event loss when resolvedTs advances ahead of prewrite replay). The original fix patched `flushRows` so that a commit row without a matching prewrite would be held back instead of being discarded with `resolvedTs` advanced past it. In production we still saw the same class of "silent event loss" in narrower windows, and there was no first-class log to correlate with the JDBC sink, so the failure mode was still hard to attribute. This commit hardens the source-side loop and adds a paired diagnostic channel on the JDBC sink so the same symptom can be observed end-to-end. Bug --- After apache#11013, the streaming loop in `captureStreamingEvents` still treated a "row == null" return as an unconditional batch boundary because the poll-attempt cap equalled `batchSize`. Concretely: 1. `cdcClient.get()` returns `null` while `getMaxResolvedTs()` has already advanced. Under region split / rebalance / backpressure the loop hits this branch many times in a row. 2. The first such return set `resolvedTsAdvances++` and continued, but the cap of `config.getBatchSize()` poll attempts meant that the next batch could still be terminated by another `null` return (the test we used on `currentMaxResolvedTs != beforeGetResolvedTs` was true the *previous* iteration, not the current one). 3. `handleRow` returned `void`, so non-record keys (index keys) were silently counted in the same `polledRows` counter as real data rows, hiding how much of the budget was actually being spent on work that emits nothing. The end result: `resolvedTs` could be pushed forward by `flushRows` without the corresponding commit/prewrite pair ever being drained, and the user had no log counter that showed the imbalance. Fix --- `TiDBSourceReader.captureStreamingEvents` - Introduce `maxPollAttempts = max(batchSize, batchSize * 10)` so the loop keeps draining metadata-driven empty polls (resolvedTs-only advances and sub-50ms returns) instead of exiting early, while still bounding worst-case work per batch. - Distinguish three cases for `cdcClient.get() == null`: (a) resolvedTs advanced since the last poll -> pure metadata, count and continue (this was already there, kept); (b) the single poll finished in < 50ms with no resolvedTs advance -> metadata-dominant batch, count and continue (new); (c) a "real" empty poll -> break. - `handleRow` now returns `boolean`; non-record keys return `false` and are counted in a new `ignoredRows` metric, while real data rows still increment `polledRows`. The streaming stats log now exposes `ignoredRows` alongside `polledRows` / `emittedRows`. `TiDBSourceReader` streaming stats log - Adds `metadataEvents`, `emptyPolls`, `pollAttempts`, and `maxPollAttempts` to the existing `[TiDB-CDC-DIAG]` log, so an operator can see at a glance whether a batch was dominated by metadata traffic (a likely companion of the original symptom). `JdbcOutputFormat` - Adds a parallel `[JDBC-SINK-DIAG]` flush-success log, emitted every 10s or whenever a single `attemptFlush()` exceeds 1s, with `records`, `totalFlushedRecords`, `flushCostMs`, and the retry `attempt` index. This gives the JDBC side the same observability that `[TiDB-CDC-DIAG]` gives the source, so the two can be correlated when investigating an end-to-end "rows missing on the sink" report. No public API, config option, or default value is changed. Verification ------------ - `./mvnw spotless:apply` on both modules - `./mvnw -q -DskipTests verify` on both modules (BUILD SUCCESS) - `./mvnw test` on both modules * connector-jdbc: 712 tests, 0 failures, 0 errors, 10 skipped * connector-cdc-tidb: 1 test, 0 failures, 0 errors Refs apache#11013 Co-authored-by: Cursor <cursoragent@cursor.com>

Empty commit to retrigger GitHub Actions on this PR after the "Workflow run detection failed" issue on the initial push. No code change. Co-authored-by: Cursor <cursoragent@cursor.com>

DanielLeens

Thanks for working on this. I reviewed the latest head from the real TiDB CDC snapshot/streaming reader path, not just from the new log lines.

What this PR fixes

User pain: when resolvedTs advances ahead of the matching prewrite replay, the old streaming path can process commits too aggressively and lose the chance to emit the corresponding row correctly. The snapshot path was also brittle around region-based scanning.
Fix approach: the PR rewrites the TiDB reader's snapshot loop, hardens the streaming poll loop, and changes flushRows(...) so an unmatched commit is held back with a lowered safeResolvedTs instead of being consumed immediately.
One-line summary: the direction is reasonable, but this is a data-correctness fix on the main CDC path, and I still need a direct regression proof for the key reordered-event scenario before I can accept it.

Actual runtime path

TiDBSourceReader.pollNext(...) [TiDBSourceReader.java:145-159]
  -> snapshotEvents(...) for INITIAL startup [162-219]
  -> captureStreamingEvents(...) [221-305]
      -> CDCClient.get()
      -> handleRow(...) [363-387]
          -> PREWRITE / COMMIT / COMMITTED stored into preWrites / commits
      -> flushRows(resolvedTs) [389-410]
          -> old path: poll commit immediately once commitTs <= resolvedTs
          -> new path: if matching prewrite is still missing, keep the commit,
             lower safeResolvedTs, and stop advancing past that point

Main findings

The core idea in flushRows(...) is correct: do not permanently consume a commit before its matching prewrite is available.
This PR is not a tiny one-line guard. It changes the snapshot loop, the streaming poll loop, the resolved-ts advancement behavior, and also adds unrelated JDBC sink diagnostics.
For a CDC data-correctness fix, the current version still misses the most important proof: a deterministic regression test that reproduces "commit visible first, prewrite arrives later, row is still emitted correctly".

Blocking issue

The PR still lacks a focused regression test for the exact reordered-event path it claims to fix.
- Location:
  TiDBSourceReader.java:221-410
- Why this matters:
  this is the real hot path for the claimed fix. captureStreamingEvents(...) -> handleRow(...) -> flushRows(...) now has different semantics around commit retention and resolved-ts advancement, but the PR does not add a reader-level or behavior-level test that proves the new logic works when the matching prewrite arrives after the commit has already been seen.
- Risk:
  this is a CDC correctness path. Without a targeted regression, we are still trusting a fairly large rewrite on reasoning alone.
- Recommended fix:
  Option A: add a focused reader/integration regression test that drives the sequence "commit first, resolvedTs advances, prewrite later" and asserts the row is emitted on the next round.
  Option B: if the current harness cannot drive the whole reader easily, at least add a smaller behavior test around handleRow(...) / flushRows(...) state transitions.
- Severity: High

Non-blocking issue
2. The JDBC sink diagnostics broaden this PR beyond the minimal TiDB CDC fix.

Location:
JdbcOutputFormat.java:45-192
Why this matters:
the PR title and the main runtime fix are about TiDB CDC source correctness, but the diff also includes JDBC sink flush diagnostics. That extra scope makes the review and CI attribution noisier than necessary.
Suggested follow-up:
ideally split the JDBC diagnostics into a separate PR, or at least explain very clearly in the PR description why they must ship together.
Severity: Medium

Tests and CI

I do not see a new targeted TiDB CDC regression test in this PR.
Build is currently red, and the failing jobs are updated-modules-integration-test-part-1 (11, ubuntu-latest) and updated-modules-integration-test-part-2 (11, ubuntu-latest).
Those are much closer to the touched modules than a random unrelated job, so without deeper log evidence I cannot safely dismiss them as unrelated noise.

Conclusion: fix required before merge

Blocking items (must fix)

Issue 1 above: please add a focused regression proof for the reordered resolvedTs / prewrite / commit path before merge.

Suggested follow-up (non-blocking)

Issue 2 above: consider splitting the JDBC sink diagnostics out of this PR.

Overall assessment

The core fix direction makes sense.
What is still missing is the proof layer. For a CDC row-loss bug, that proof should come from a targeted regression, not just from a broad code rewrite plus generic CI.
Once the regression is in place and the red Build jobs are explained, I am happy to re-review the latest head.

zhaoysg · 2026-06-18T01:29:09Z

感谢详细 review。

我已经补了一个针对 TiDB CDC 乱序事件路径的聚焦回归测试：
TiDBSourceReaderTest#flushRowsShouldHoldCommitUntilMatchingPrewriteArrives。

这个测试覆盖 review 中提到的关键状态转换：

先收到 COMMIT；
resolvedTs 已经超过 commitTs 时调用 flushRows，并验证 commit 会被保留，同时 safeResolvedTs 降到 commitTs - 1；
后续收到匹配的 PREWRITE；
再次调用 flushRows，验证 prewrite row 会被发出，并且 pending commit/prewrite 状态被清空。

本地验证：

./mvnw.cmd -pl seatunnel-connectors-v2/connector-cdc/connector-cdc-tidb test
./mvnw.cmd -q -pl seatunnel-connectors-v2/connector-cdc/connector-cdc-tidb -DskipTests verify

以上两个命令均通过。

我也尝试了全仓 ./mvnw.cmd -q -DskipTests verify，但在解析 org.apache.flink:flink-statebackend-rocksdb:1.20.1 时 Maven Central TLS handshake/download 失败，构建在到达本模块前中断。

DanielLeens

Thanks for the follow-up and for adding the focused regression tests. I re-reviewed the latest head from scratch.

Sorry for missing this in the last review: the JDBC side here is not just extra diagnostics. Looking at the current head and the still-red changed-module CI, I should treat that part as a blocker on this version as well.

The new TiDBSourceReaderTest#flushRowsShouldHoldCommitUntilMatchingPrewriteArrives and TiDBSourceSplitEnumeratorTest do close the specific proof gap I called out earlier for the reordered commit/prewrite path and the same-reader split overwrite path. So the TiDB reader fix itself is in much better shape now.

What this PR solves

Pain point: the TiDB CDC source could silently lose a row when COMMIT arrived first, resolvedTs moved past it, and the matching PREWRITE only showed up later. The enumerator also overwrote pending splits when multiple returned splits mapped to the same reader.
Approach: keep unmatched commits in flushRows(...), lower the safe resolvedTs, store pending splits as lists per reader, and add focused unit coverage.
One-line summary: the TiDB-side fix direction is right, but the current head still has one checkpoint-compatibility break and one JDBC-scope / CI blocker before merge.

Full runtime path I rechecked

TiDB source main path
  -> TiDBSourceReader.pollNext(...) [TiDBSourceReader.java:145-159]
      -> snapshotEvents(...) [162-219]
      -> captureStreamingEvents(...) [221-305]
          -> CDCClient.get()
          -> handleRow(...) [363-387]
              -> COMMIT / PREWRITE / COMMITTED buffered into commits / preWrites
          -> flushRows(resolvedTs) [389-410]
              -> if the earliest commit has no matching prewrite yet
                   keep the commit in the map
                   lower safeResolvedTs to commitTs - 1
                   stop advancing past that point
              -> otherwise poll the commit and emit the matched prewrite row

checkpoint / restore path
  -> TiDBSourceSplitEnumerator.snapshotState(...) [225-228]
      -> TiDBSourceCheckpointState.pendingSplit
      -> SeaTunnelSource.getEnumeratorStateSerializer() [SeaTunnelSource.java:120-121]
          -> DefaultSerializer.serialize(...) [DefaultSerializer.java:28-41]
  -> restore
      -> DefaultSerializer.deserialize(...)
      -> TiDBSource.restoreEnumerator(...) [TiDBSource.java:136-146]

JDBC sink path included by this PR
  -> JdbcSinkWriter / JdbcExactlyOnceSinkWriter
      -> JdbcOutputFormat.writeRecord(...) [105-118]
          -> batchSize or batchIntervalMs can now trigger flush()

Main findings

The TiDB reader now correctly covers the core reordered-event case I asked for earlier.
The new enumerator state shape is on the checkpoint / restore path, not just an in-memory refactor.
The JDBC change is not logging-only. It changes JdbcOutputFormat.writeRecord(...) behavior on the normal sink path.
The current Build check is still red, and two of the failed jobs are JDBC E2E jobs in a module touched by this PR.

Findings

Issue 1: changing `TiDBSourceCheckpointState.pendingSplit` in place breaks restore compatibility for existing checkpoints / savepoints

Location:
- seatunnel-connectors-v2/connector-cdc/connector-cdc-tidb/src/main/java/org/apache/seatunnel/connectors/seatunnel/cdc/tidb/source/enumerator/TiDBSourceCheckpointState.java:35-38
- seatunnel-api/src/main/java/org/apache/seatunnel/api/source/SeaTunnelSource.java:120-121
- seatunnel-api/src/main/java/org/apache/seatunnel/api/serialization/DefaultSerializer.java:28-41
- seatunnel-translation/seatunnel-translation-base/src/main/java/org/apache/seatunnel/translation/source/ParallelSource.java:103-106
Why this is a problem:
- pendingSplit changed from Map<Integer, TiDBSourceSplit> to Map<Integer, List<TiDBSourceSplit>>.
- This object is serialized as the enumerator checkpoint state through the default serializer path. There is no custom versioned serializer and no backward-compatibility branch on restore.
- So this is not just a runtime structure cleanup. It changes the persisted checkpoint schema for a running unbounded CDC source.
Risk:
- Existing TiDB CDC jobs restored from an older checkpoint/savepoint can fail to deserialize or otherwise fail to recover cleanly.
- That is a checkpoint-main-path compatibility break.
Best fix:
- Option A: add a versioned custom serializer for the enumerator state and explicitly migrate the old Map<Integer, TiDBSourceSplit> shape into the new list-based shape.
- Option B: keep backward-compatible restore handling for the old state layout before normalizing it.
- If you intentionally keep this as an incompatible upgrade, it also needs to be documented in both incompatible-changes docs with migration guidance.
Severity: High
Raised by others already: No

Issue 2: the JDBC side is a real sink-behavior change, not just diagnostics, and the changed-module CI is still red

Location:
- seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/JdbcOutputFormat.java:105-113
- seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/JdbcOutputFormat.java:236-239
- current failed jobs on this head:
  - updated-modules-integration-test-part-1 -> connector-jdbc-e2e-ddl -> SqlServerSchemaChangeIT ... ConditionTimeout
  - updated-modules-integration-test-part-2 -> connector-jdbc-e2e-part-1 -> JdbcDb2UpsertIT.testDb2UpsertE2e
Why this matters:
- writeRecord(...) now flushes not only on batchSize, but also on batchIntervalMs.
- That changes the normal JDBC sink execution path and retry / transaction timing. It is broader than “paired diagnostics for troubleshooting”.
- On the current head, the visible Build check is still red, and two failed jobs are JDBC E2E jobs in a module touched by this change. From the current evidence I cannot safely dismiss them as unrelated noise.
Risk:
- This can change flush frequency and transaction boundaries across JDBC sinks, including exactly-once and schema-change paths.
- Merging it together with the TiDB fix keeps the PR scope noisy and leaves the changed-module CI unresolved.
Best fix:
- Option A: split the JDBC change into a separate PR and keep this PR focused on the TiDB CDC fix.
- Option B: if the JDBC change must stay here, add focused JDBC coverage for the new batchIntervalMs path and get the JDBC E2E failures green (or provide a concrete, reproducible proof that they are unrelated).
Severity: High
Raised by others already: Partially. I mentioned the JDBC scope concern in the previous round, but I should have called out the runtime behavior change and the still-red changed-module CI more explicitly.

Tests and CI

The new TiDB unit tests are useful and stable:
- TiDBSourceReaderTest.java:42-68 is deterministic and directly covers the reordered COMMIT -> flush -> PREWRITE -> flush path.
- TiDBSourceSplitEnumeratorTest.java:45-68 is also deterministic and covers the same-reader multi-split reassignment case.
Test stability rating for the newly added UTs: Stable
- No sleep, no time-based assertions, no port allocation, no shared static state, no order-sensitive collection assertions.
Coverage gaps still left:
- no restore-compatibility test for the new enumerator checkpoint shape
- no JDBC test for the new batchIntervalMs flush path
Current visible Build failures on head 53d729c94bbff089113ebfa379be4d8957966aa1:
- connector-jdbc-e2e-ddl: SqlServerSchemaChangeIT ... ConditionTimeout
- connector-jdbc-e2e-part-1: JdbcDb2UpsertIT.testDb2UpsertE2e
- connector-cdc-mysql-e2e: MysqlCDCWithSchemaChangeIT
MySQL CDC looks unrelated to the files changed here, but the two JDBC E2E failures land on a touched module, so I would not wave those away on this version.

Merge conclusion

Conclusion: can merge after fixes

Blockers

Issue 1: please make the enumerator checkpoint state backward-compatible before merge.
Issue 2: please either split the JDBC change out, or prove it with tests and green JDBC E2E on this head.

Non-blocking suggestions

No extra non-blocking items from this round. The two issues above are already enough to block merge.

Overall, the TiDB reader fix itself is much closer now, and the new regression proof helped a lot. The remaining blockers are around upgrade safety and PR scope control, not around the original reordered-event diagnosis anymore. Happy to re-review once you push an update.

zhaoysg · 2026-06-22T02:00:21Z

Thanks for the careful re-review.

I pushed an update in fbf56053d.

Changes in this update:

Added backward-compatible normalization for TiDBSourceCheckpointState.pendingSplit so an older checkpoint state shaped as Map<Integer, TiDBSourceSplit> is restored into the new Map<Integer, List<TiDBSourceSplit>> representation.
Added a focused restore-compatibility regression test covering that legacy single-split pending map shape and verifying the restored enumerator can reassign it.
Removed the JDBC diagnostics change from this PR, so the PR diff is now focused on TiDB CDC again and no longer touches connector-jdbc relative to the PR base.

One clarification on the JDBC point: after checking the PR diff against the base, the batchIntervalMs-triggered writeRecord flush path was already present on the base branch; this PR had only added flush diagnostics in JdbcOutputFormat. Still, I agree it was unnecessary scope for this TiDB fix, so I removed it from this PR.

Local verification:

./mvnw -DspotlessFiles=seatunnel-connectors-v2/connector-cdc/connector-cdc-tidb/src/main/java/org/apache/seatunnel/connectors/seatunnel/cdc/tidb/source/enumerator/TiDBSourceCheckpointState.java,seatunnel-connectors-v2/connector-cdc/connector-cdc-tidb/src/test/java/org/apache/seatunnel/connectors/seatunnel/cdc/tidb/source/enumerator/TiDBSourceSplitEnumeratorTest.java,seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/JdbcOutputFormat.java spotless:check
./mvnw -pl seatunnel-connectors-v2/connector-cdc/connector-cdc-tidb -DskipIT -Dskip.spotless=true -Dcheckstyle.skip=true -Dtest=TiDBSourceReaderTest,TiDBSourceSplitEnumeratorTest test

The focused TiDB tests pass locally: 4 tests, 0 failures.

DanielLeens

Thanks for the update. I re-reviewed the latest head from the TiDB checkpoint-state restore path and the new regression coverage, and the exact PR scope now looks good to me.

What this PR fixes

User pain: After the TiDB CDC pending-split structure changed from a single split to a list, old checkpoint state could no longer restore cleanly.
Fix approach: The PR normalizes legacy Map<Integer, TiDBSourceSplit> state into the new Map<Integer, List<TiDBSourceSplit>> shape in the constructor, setter, and readObject() path.
One-line summary: Legacy checkpoint state now restores cleanly into the new pending-split structure.

Runtime path I traced

Restore path
  -> Java deserialization of `TiDBSourceCheckpointState`
      -> `readObject()` normalizes the pending-split map
      -> legacy single-split entries are wrapped into singleton lists

Result
  -> old checkpoint state stays readable after the internal shape change

Key findings

This change is on the restore path, which is exactly where the compatibility bug lives.
Constructor, setter, and readObject() now all normalize state the same way.
The exact PR scope is clean again on the current head.

Blocking items
None at the source-code level on the current head.

Other review notes

No additional non-blocking source-level comments from this round.

Compatibility and side effects

Compatibility: Fully backward compatible — that is the main point of this fix.
Side effects: The normalization cost is tiny and only happens on restore / state-setting paths.
Error handling / logging: Unsupported state shapes still fail fast instead of being silently swallowed.
Tests: The added regression coverage matches the legacy-single-split restore scenario that matters here.
Docs: No docs update is needed.
CI: The red checks I traced are in SQLServer/MySQL CDC schema-evolution jobs, not in the exact TiDB file scope of this PR. I would sync the latest dev and rerun CI.

Merge verdict

From the current head I do not see a source-level blocker. The legacy checkpoint-state compatibility issue looks fixed on the exact PR files.

github-actions Bot added cdc connectors-v2 labels Jun 17, 2026

github-actions Bot removed the cdc label Jun 17, 2026

ci: retrigger workflows

260bd10

Empty commit to retrigger GitHub Actions on this PR after the "Workflow run detection failed" issue on the initial push. No code change. Co-authored-by: Cursor <cursoragent@cursor.com>

DanielLeens reviewed Jun 17, 2026

View reviewed changes

davidzollo added the First-time contributor First-time contributor label Jun 17, 2026

[Test][Connector-V2] Add TiDB CDC reorder regression test

08a4ec7

[Fix][Connector-V2] Preserve TiDB CDC splits assigned to same reader

53d729c

DanielLeens suggested changes Jun 19, 2026

View reviewed changes

[Fix][Connector-V2] Keep TiDB CDC checkpoint restore compatible

fbf5605

github-actions Bot added the cdc label Jun 22, 2026

[Fix][Connector-V2] Format TiDB CDC checkpoint state

6763d58

DanielLeens approved these changes Jun 22, 2026

View reviewed changes

github-actions Bot added the reviewed label Jun 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix][Connector-V2] Fix TiDB CDC silent row event loss when resolvedT…#11113

[Fix][Connector-V2] Fix TiDB CDC silent row event loss when resolvedT…#11113
zhaoysg wants to merge 7 commits into
apache:devfrom
zhaoysg:fix/tidb-cdc-event-loss

zhaoysg commented Jun 17, 2026 •

edited

Loading

Uh oh!

DanielLeens left a comment

Uh oh!

zhaoysg commented Jun 18, 2026

Uh oh!

DanielLeens left a comment

Uh oh!

zhaoysg commented Jun 22, 2026

Uh oh!

DanielLeens left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zhaoysg commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

Uh oh!

DanielLeens left a comment

Choose a reason for hiding this comment

Conclusion: fix required before merge

Uh oh!

zhaoysg commented Jun 18, 2026

Uh oh!

DanielLeens left a comment

Choose a reason for hiding this comment

What this PR solves

Full runtime path I rechecked

Main findings

Findings

Issue 1: changing TiDBSourceCheckpointState.pendingSplit in place breaks restore compatibility for existing checkpoints / savepoints

Issue 2: the JDBC side is a real sink-behavior change, not just diagnostics, and the changed-module CI is still red

Tests and CI

Merge conclusion

Conclusion: can merge after fixes

Uh oh!

zhaoysg commented Jun 22, 2026

Uh oh!

DanielLeens left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhaoysg commented Jun 17, 2026 •

edited

Loading

Issue 1: changing `TiDBSourceCheckpointState.pendingSplit` in place breaks restore compatibility for existing checkpoints / savepoints