[backport] Backport sweep for 9.1 by valkeyrie-ops[bot] · Pull Request #3987 · valkey-io/valkey

valkeyrie-ops · 2026-06-15T16:14:38Z

Backport sweep for 9.1

Automated cherry-picks from PRs marked "To be backported".

Applied

Source PR	Title	Detail
#3938	Fix IO-Threads redesign cleanup perf regression from #3544
#3964	Omit alldbs rule in ACL SAVE/LIST and CONFIG REWRITE for compatibility	conflicts resolved by Claude Code
#3920	Reject integer overflow of length fields in zipmapValidateIntegrity
#3921	Reject NAN scores in listpack/ziplist-encoded sorted sets on RDB load
#3959	Stabilize CLUSTERSCAN unassigned-slot test by retrying DELSLOTS
#3939	Fix RESP3 type violation in addReplyCommandSubCommands
#3811	Fix off_t to int truncation in bio repl transfer size reporting
#3843	Increase max proctitle length from 255 to 1024
#3980	Reduce io-threads modifiability test iterations under Valgrind
#4049	Fix HGETDEL FIELDS token validation
#4016	Fix build warnings with OpenSSL 4.0	conflicts resolved by Claude Code
#4047	Fix HRANDFIELD CASE 4 infinite loop when valid fields fewer than count
#4024	Restore client's selected DB after module keyspace notification
#4073	fix: Reject corrupt stream RDB with shared NACK across consumers
#4068	Fix sentinel failover coordinated segfault when old leader's client is disconnected

AI resolution details are posted as comments on this PR when available.

Generated by valkey-ci-agent using Claude Code.

This regression is still present in 9.1 GA as the cherrypick of revert commit was missed during release. Re-applies #3544 (reverted in #3756 due to ~20% SET regression) with the performance fix from #3760. **Root Cause:** The original #3544 changed `tryOffloadFreeObjToIOThreads` to only offload the SDS buffer free to IO threads, freeing the `robj` shell on the main thread. I carried out profiling for the change and it showed that freeing the `robj` shell on the main thread became the prime main-thread hotspot (~10% CPU), while IO threads shifted from doing real `jemalloc` work to spinning idle on `spmcDequeue`. **Fix**: Keep `tryOffloadFreeObjToIOThreads` offloading the entire robj (`decrRefCount`) to the IO thread. Cross-thread `zfree` is safe with `jemalloc`. This PR includes all cleanup work from #3544 so - - `trySendWriteToIOThreads`: defer clearing `last_header` until after successful enqueue - `evictClients`: simplified bookkeeping - Queue sizes as runtime parameters instead of compile-time macros - IO ignition policy using `stat_active_time` instead of `getrusage` - Function renames (`IOThreadFreeArgv` --> `ioThreadFreeArgv`, etc.) and doc comments **Benchmark** on (Graviton4 c8gb.metal-48xl): Config: SET, 128B values, 9 IO threads, pipeline=10, 1600 clients - Same as Valkey official method | Version | Throughput | |---------|-----------| | Unstable + original #3544 | ~1,554K rps | | Unstable + this PR | ~2,116K rps | <details> <summary>Diff vs original #3544 (perf fix)</summary> ```diff diff --git a/src/io_threads.c b/src/io_threads.c --- a/src/io_threads.c +++ b/src/io_threads.c @@ // IO thread handler case JOB_REQ_FREE_OBJ: - zfree(data); + decrRefCount(data); break; @@ // tryOffloadFreeObjToIOThreads - /* We offload only the free of the ptr that may be allocated by the I/O thread. - * The object itself was allocated by the main thread and will be freed by the main thread. */ - void *job = tagJob(sdsAllocPtr(objectGetVal(obj)), JOB_REQ_FREE_OBJ); + void *job = tagJob(obj, JOB_REQ_FREE_OBJ); if (unlikely(spmcEnqueue(&io_shared_inbox, job) == false)) return C_ERR; - objectSetVal(obj, NULL); - decrRefCount(obj); io_jobs_submitted++; ``` </details> --------- Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>

#3964) Database-level ACL #2309 introduced `alldbs` rule that was explicit for all users and because of that previous versions no longer had the ability to parse ACL strings produced by later versions. Omit `alldbs` in `ACLDescribeSelector()`, that is used in `ACL SAVE/LOAD` and `CONFIG REWRITE` command paths so that downgrades would be possible if new feature was not used (`db=` and `resetdbs` rules). Keep `ACL GETUSER` command's output as is and return `alldbs` in `databases` field because of command's field-value format. Add test to check that `ACL LIST` omits implicit `alldbs` and add check to existing `ACL SAVE` and `CONFIG REWRITE` tests. Fixes #3915 Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>

codecov · 2026-06-15T16:39:29Z

Codecov Report

❌ Patch coverage is 94.39252% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.47%. Comparing base (71545e0) to head (cfc72d4).

Files with missing lines	Patch %	Lines
src/module.c	0.00%	3 Missing ⚠️
src/io_threads.c	94.44%	1 Missing ⚠️
src/sentinel.c	0.00%	1 Missing ⚠️
src/server.c	80.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##              9.1    #3987      +/-   ##
==========================================
- Coverage   76.73%   76.47%   -0.26%     
==========================================
  Files         163      163              
  Lines       81098    81129      +31     
==========================================
- Hits        62228    62046     -182     
- Misses      18870    19083     +213

Files with missing lines	Coverage Δ
src/acl.c	`92.55% <ø> (-0.13%)`	⬇️
src/networking.c	`92.05% <100.00%> (-0.18%)`	⬇️
src/queues.c	`99.35% <100.00%> (+0.02%)`	⬆️
src/rdb.c	`77.01% <100.00%> (+0.08%)`	⬆️
src/replication.c	`86.15% <100.00%> (+0.22%)`	⬆️
src/server.h	`100.00% <ø> (ø)`
src/setproctitle.c	`76.19% <ø> (ø)`
src/t_hash.c	`95.28% <100.00%> (+0.01%)`	⬆️
src/t_zset.c	`96.86% <100.00%> (-0.09%)`	⬇️
src/tls.c	`17.64% <ø> (ø)`
... and 6 more

... and 12 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…3920) ## Problem A crafted zipmap entry can set the value length to a value near `UINT32_MAX` so that the `l + e` sum (value length + one-byte free space) wraps in `unsigned int` arithmetic. The wrapped sum advances the validation cursor by a tiny amount, leaving `p` inside the buffer, so the `OUT_OF_RANGE` check passes and `zipmapValidateIntegrity` wrongly returns success. The field-length path has the same shape — advancing `p` by a ~4GB length wraps the pointer on 32-bit builds. `zipmapValidateIntegrity` is always called with `deep=1` from `rdb.c` when loading `RDB_TYPE_HASH_ZIPMAP`, **including via `RESTORE`**, so any client with `RESTORE` access can submit a payload that passes validation. On 32-bit platforms this leads to out-of-bounds access during the subsequent zipmap→listpack conversion. On 64-bit the downstream `lpSafeToAdd` cap happens to reject it (the raw ~4GB length exceeds `LISTPACK_MAX_SAFETY_SIZE`), but the validator should not accept a malformed payload in the first place — this is the function whose sole job is to reject it. ## Fix Bounds-check the attacker-controlled length against the bytes remaining in the zipmap, in 64-bit space, **before** any pointer arithmetic, for both the field-length and value-length paths. ## Testing - `tests/integration/corrupt-dump.tcl`: a `RESTORE`-path test exercising the full attack surface; asserts rejection and that the server stays up. - Verified the test **fails on the pre-fix code** (validator accepts the value-length payload) and **passes after the fix**, confirmed by stashing the fix during the integration run. - Full `integration/corrupt-dump` suite: 76 passed, 0 failed. > [!NOTE] > Found via structure-aware fuzzing of the RESTORE path. This issue was generated by AI but verified, with love, by a human. Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>

…#3921) ## Problem A crafted `RESTORE` payload can store a `NAN` score in a listpack-encoded sorted set. The integrity validation (`lpValidateIntegrityAndDups`) only checks the listpack *structure* and member uniqueness — it does not check score validity — so the payload is accepted on load. When the sorted set is later converted to a skiplist (e.g. when it grows past `zset-max-listpack-entries`, or via any operation that triggers conversion), `zslInsertNode()` asserts the score is not `NAN` (`t_zset.c:260`) and the server aborts. **Any client with `RESTORE` access can remotely crash the server.** The skiplist RDB format (`RDB_TYPE_ZSET` / `RDB_TYPE_ZSET_2`) already rejects `NAN` scores at load time (`rdb.c`, "Zset with NAN score detected"). The listpack format (`RDB_TYPE_ZSET_LISTPACK`) had no equivalent check. ## Reproduction ``` RESTORE k 0 "\x11\x19\x19\x00\x00\x00\x04\x00\x82m1\x03\x83nan\x04\x82m2\x03\x832.5\x04\xFF\x50\x00...." # loads OK, then: ZADD k 9 x # forces listpack->skiplist conversion -> serverAssert(!isnan(node->score)) -> SIGABRT ``` ## Fix Add `zzlValidateScores()`, which scans the scores of a listpack zset after structural validation and rejects the payload if any score is `NAN`. Mirrors the existing skiplist-format check. `inf`/`-inf` and large finite scores remain accepted (only `NAN` is rejected), matching normal `ZADD` semantics. ## Testing - `tests/integration/corrupt-dump.tcl`: a `RESTORE`-path test asserting rejection and that the server stays up. - Verified the test **fails on the pre-fix code** (server crashes on conversion) and **passes after the fix**, by stashing the fix during the run. - Confirmed valid zsets, including `inf`/`-inf`/large finite scores, still load and convert correctly. - Full `integration/corrupt-dump` suite: 74 passed, 0 failed. > [!NOTE] > Found via structure-aware fuzzing of the RESTORE path. This issue was generated by AI but verified, with love, by a human. Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>

The Case 3 portion of the test was flaky: after a single round of `CLUSTER DELSLOTS 0` on R0/R1/R2, the cluster could stay in OK state and `wait_for_cluster_state fail` would time out with `Cluster node 1 cluster_state:ok`. The race is between R0's local DELSLOTS and the gossip already in flight from R0. After R1 locally clears slot 0, a stale pre-DELSLOTS packet from R0 (whose myslots still claims slot 0) hits the isSlotUnclaimed fast path in clusterUpdateSlotsConfigWith and rebinds slot 0 back to R0 on R1. See: ``` if (isSlotUnclaimed(j) || server.cluster->slots[j]->configEpoch < senderConfigEpoch || clusterSlotFailoverGranted(j)) { ... clusterDelSlot(j); clusterAddSlot(sender, j); ... } ``` R0's subsequent "no longer claiming" PINGs cannot undo this, because that path only sets owner_not_claiming_slot and never clears slots[j]: ``` if (server.cluster->slots[j] == sender) { /* The slot is currently bound to the sender but the sender is no longer * claiming it. We don't want to unbind the slot yet as it can cause the cluster * to move to FAIL state and also throw client error. Keeping the slot bound to * the previous owner will cause a few client side redirects, but won't throw * any errors. We will keep track of the uncertainty in ownership to avoid * propagating misinformation about this slot's ownership using UPDATE * messages. */ bitmapSetBit(server.cluster->owner_not_claiming_slot, j); } ``` Combined with clusterUpdateState's full-coverage check looking only at slots[j] == NULL, R1 stays at cluster OK forever. ``` if (server.cluster->slots[j] == NULL || ...) { new_state = CLUSTER_FAIL; ... } ``` Rather than fighting the protocol's intentional asymmetry around "soft delete" via gossip, just retry the DELSLOTS pass until all three nodes converge to FAIL. This keeps the test focused on the CLUSTERSCAN error semantics it actually wants to verify. This closes #3891. The test was added in #3674. Signed-off-by: Binbin <binloveplay1314@qq.com>

## Summary `addReplyCommandSubCommands` unconditionally called `addReplySetLen(c, 0)` when a command has no subcommands, emitting a RESP3 Set type prefix (`~0`) regardless of the `use_map` parameter. The non-empty path (below it) already branches correctly on `use_map` — the empty early-return was simply missing the same logic. In RESP3, `COMMAND INFO <cmd>` returns the subcommands field as a Set (`~0`) instead of an Array (`*0`) for any command without subcommands (e.g. PING). Strict RESP3 client libraries that dispatch on collection type will misinterpret the response. Not visible in RESP2 since both Set and Array use the `*` prefix there. ## Fix Apply the same `use_map` branch to the empty case: - `addReplyMapLen(c, 0)` when `use_map=1` - `addReplyArrayLen(c, 0)` otherwise ## Test Added a `readraw` integration test in `tests/unit/introspection-2.tcl` that inspects the raw wire-level type byte for the subcommands field of `COMMAND INFO ping` in RESP3 mode, asserting `*0` (Array) rather than `~0` (Set). Signed-off-by: Rick Ramsay <49293857+rickrams@users.noreply.github.com> Signed-off-by: rickrams <rickrams@amazon.com>

`off_t` (64-bit), but were read into `int` (32-bit) locals in `genValkeyInfoString()` and `handleBioThreadFinishedRDBDownload()`. This causes INFO replication to report negative `master_sync_total_bytes` during bio disk-based sync when RDB exceeds 2GB. Fix: change the local variable types from `int` to `off_t`. Signed-off-by: chx9 <lovelypiska@outlook.com>

255 is too short if valkey-server is being run from a long path, especially if many fields are included such as in the "Process title set as expected" test in `unit/other.tcl`, where the max length for the cmdline is 1024, so having the same number in both makes sense. Closes #3832. Signed-off-by: Petr Khartskhaev <pkhartsk@redhat.com>

The `test io-threads are runtime modifiable` test in `tests/unit/other.tcl` times out on the dedicated Valgrind jobs of the daily CI, failing the run. The failing test is introduced in #3938. This PR reduces the loop to 10 iterations under Valgrind. **Failure links:** - https://github.com/valkey-io/valkey/actions/runs/27386948127 (Jun 12) - https://github.com/valkey-io/valkey/actions/runs/27315974006 (Jun 11) - https://github.com/valkey-io/valkey/actions/runs/27245311034 (Jun 10) --------- Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

Add explicit validation for the FIELDS token before parsing numfields. Fixes #4045. Signed-off-by: lcxn123 <156999965+lcxn123@users.noreply.github.com> Co-authored-by: Binbin <binloveplay1314@qq.com>

Some functions are newly deprecated in OpenSSL 4.0. This commit works around those. More specifically: - X509_cmp_current_time() in isCertValid() is no longer necessary as X509_check_certificate_times() compares the notBefore and notAfter on its own. Had to add a version check since this function is new in OpenSSL 4.0. - Replace deprecated X509_NAME_get_text_by_NID. Not a perfect fix, because the new implementation still assumes that the name does not contain embedded null characters which may not be true, e.g., if the name is of type UniversalString or BMPString. - Also fix constness of X509_get_subject_name return value. The latter two fixes are taken from this Fedora patch: https://src.fedoraproject.org/rpms/valkey/c/1a9c8847172ef3fb116a1e2fdb3871692378adae?branch=rawhide Closes #4012 Signed-off-by: Petr Khartskhaev <pkhartsk@redhat.com>

valkeyrie-ops · 2026-06-27T09:48:57Z

AI conflict resolution: source PR #4016

Fix build warnings with OpenSSL 4.0

1 conflicted file

Claude Summary

Resolution complete. Brace balance is even (250/250), confirming the file is structurally intact.

Summary of what I did:

The cherry-pick of PR #4016 ("Fix build warnings with OpenSSL 4.0") had a single conflict region in src/tls.c (between tlsPasswordCallback and createSSLContext):

HEAD (9.1) side: empty

Incoming side: three complete functions — isCertValid, loadCaCertDir, areAllCaCertsValid

The PR's actual diff only modifies isCertValid() (adding an OpenSSL 4.0 #if branch) and getCertSubjectFieldByName(). It does not create those three functions — they pre-existed on the source branch as part of a separate CA-cert-validation feature that does not exist on 9.1 (I verified all three are absent and referenced nowhere in the 9.1 tree). The git 3-way merge presented the whole functions as "incoming" only because the PR touched lines inside one of them.

So I took the empty HEAD side, dropping the entire incoming block and the conflict markers. Pulling in those ~94 lines would have added unrelated functionality the PR never introduced and that nothing on 9.1 calls.

The PR's other hunk — the getCertSubjectFieldByName() rework (the outlen == 0 guard, the const fix, and the replacement of deprecated X509_NAME_get_text_by_NID) — applied cleanly during the cherry-pick and is preserved. That is the one logical change from this PR that is applicable to 9.1, and git diff HEAD confirms it's the sole remaining change.

No conflict markers remain. I did not run git add or git commit, and did not touch any other files.

AI-resolved conflicted files

src/tls.c — view diff

Full backport commit diff: commit d9c38d207f09.

Please review these AI resolutions for correctness before merging.

#4047) ### Problem `HRANDFIELD` with a positive `count` can enter an infinite loop (CASE 4) when the number of non-expired fields in a hash is less than `count`. **Example:** A hash has 100 fields total: 71 with short TTLs (already expired) and 29 persistent fields. A client runs: ``` HRANDFIELD myhash 30 ``` Since `count * 3 = 90 < 100 = size`, CASE 4 is selected. The loop calls `hashTypeRandomElement()`, which skips expired fields (via `validateEntry`), so it always returns one of the 29 valid fields. `hashTypeRandomElement()` retries up to 100 times internally to find a non-expired field. However, as long as any valid fields exist, this retry limit is rarely exhausted. After collecting all 29 unique valid fields, every subsequent `hashtableAdd()` returns false (duplicate), but `added = 29 < count = 30`, so the loop never breaks. This is reproducible when active expiry hasn't cleaned up the expired fields yet (e.g. on a replica, or during a burst of expirations). ### Fix Add a `maxiter = count * 10` limit to the CASE 4 loop. When the iteration budget is exhausted, return whatever unique fields have been collected so far — returning fewer results is acceptable. Signed-off-by: cjx-zar <jxchenczar@foxmail.com> Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Binbin <binloveplay1314@qq.com>

moduleNotifyKeyspaceEvent() runs subscriber callbacks on the executing client and calls selectDb(dbid) on it, but never restores the DB. When dbid differs from the client's current DB (e.g. MOVE/COPY notify on the destination DB), the client is left on the wrong DB, breaking subsequent commands. This was introduced by #1819, which started reusing the executing client instead of a temporary one. Save the executing client's DB before the subscriber loop and restore it afterwards. Covers both MOVE and COPY. Signed-off-by: Binbin <binloveplay1314@qq.com>

When loading a stream consumer group from RDB, the loader assigns each NACK's consumer pointer without checking if it was already set. A corrupt RDB payload can list the same message ID under multiple consumers' PELs, resulting in a shared NACK. This leads to use-after-free when one consumer is deleted while another still references the same NACK. Add a check that nack->consumer is NULL before assigning. If already set, reject the RDB as corrupt. Signed-off-by: Saurabh Kher <saurabh@amazon.com>

…s disconnected (#4068) Prevent a segfault that can happen during a sentinel coordinated failover (a new feature introduced in #1292) that can happen when the coordinated failover attempting to clean up the former leader races with that leader disconnecting. Add a check for `link->cc` in `sentinelKillClients`. Otherwise the sentinel node will segfault in this case. Fixes #4066. Signed-off-by: Luke Palmer <luke@lukepalmer.net> Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Binbin <binloveplay1314@qq.com>

roshkhatri and others added 2 commits June 15, 2026 16:09

madolson and others added 9 commits June 16, 2026 10:36

Fix HGETDEL FIELDS token validation (#4049)

9b4767b

Add explicit validation for the FIELDS token before parsing numfields. Fixes #4045. Signed-off-by: lcxn123 <156999965+lcxn123@users.noreply.github.com> Co-authored-by: Binbin <binloveplay1314@qq.com>

cjx-zar and others added 4 commits June 30, 2026 10:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[backport] Backport sweep for 9.1#3987

[backport] Backport sweep for 9.1#3987
valkeyrie-ops[bot] wants to merge 15 commits into
9.1from
agent/backport/sweep/9.1

valkeyrie-ops Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

valkeyrie-ops Bot commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

Uh oh!

Conversation

valkeyrie-ops Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backport sweep for 9.1

Applied

Uh oh!

codecov Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

valkeyrie-ops Bot commented Jun 27, 2026

AI conflict resolution: source PR #4016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

valkeyrie-ops Bot commented Jun 15, 2026 •

edited

Loading

codecov Bot commented Jun 15, 2026 •

edited

Loading