[backport] Backport sweep for 9.1#3987
Conversation
This regression is still present in 9.1 GA as the cherrypick of revert commit was missed during release. Re-applies #3544 (reverted in #3756 due to ~20% SET regression) with the performance fix from #3760. **Root Cause:** The original #3544 changed `tryOffloadFreeObjToIOThreads` to only offload the SDS buffer free to IO threads, freeing the `robj` shell on the main thread. I carried out profiling for the change and it showed that freeing the `robj` shell on the main thread became the prime main-thread hotspot (~10% CPU), while IO threads shifted from doing real `jemalloc` work to spinning idle on `spmcDequeue`. **Fix**: Keep `tryOffloadFreeObjToIOThreads` offloading the entire robj (`decrRefCount`) to the IO thread. Cross-thread `zfree` is safe with `jemalloc`. This PR includes all cleanup work from #3544 so - - `trySendWriteToIOThreads`: defer clearing `last_header` until after successful enqueue - `evictClients`: simplified bookkeeping - Queue sizes as runtime parameters instead of compile-time macros - IO ignition policy using `stat_active_time` instead of `getrusage` - Function renames (`IOThreadFreeArgv` --> `ioThreadFreeArgv`, etc.) and doc comments **Benchmark** on (Graviton4 c8gb.metal-48xl): Config: SET, 128B values, 9 IO threads, pipeline=10, 1600 clients - Same as Valkey official method | Version | Throughput | |---------|-----------| | Unstable + original #3544 | ~1,554K rps | | Unstable + this PR | ~2,116K rps | <details> <summary>Diff vs original #3544 (perf fix)</summary> ```diff diff --git a/src/io_threads.c b/src/io_threads.c --- a/src/io_threads.c +++ b/src/io_threads.c @@ // IO thread handler case JOB_REQ_FREE_OBJ: - zfree(data); + decrRefCount(data); break; @@ // tryOffloadFreeObjToIOThreads - /* We offload only the free of the ptr that may be allocated by the I/O thread. - * The object itself was allocated by the main thread and will be freed by the main thread. */ - void *job = tagJob(sdsAllocPtr(objectGetVal(obj)), JOB_REQ_FREE_OBJ); + void *job = tagJob(obj, JOB_REQ_FREE_OBJ); if (unlikely(spmcEnqueue(&io_shared_inbox, job) == false)) return C_ERR; - objectSetVal(obj, NULL); - decrRefCount(obj); io_jobs_submitted++; ``` </details> --------- Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
#3964) Database-level ACL #2309 introduced `alldbs` rule that was explicit for all users and because of that previous versions no longer had the ability to parse ACL strings produced by later versions. Omit `alldbs` in `ACLDescribeSelector()`, that is used in `ACL SAVE/LOAD` and `CONFIG REWRITE` command paths so that downgrades would be possible if new feature was not used (`db=` and `resetdbs` rules). Keep `ACL GETUSER` command's output as is and return `alldbs` in `databases` field because of command's field-value format. Add test to check that `ACL LIST` omits implicit `alldbs` and add check to existing `ACL SAVE` and `CONFIG REWRITE` tests. Fixes #3915 Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## 9.1 #3987 +/- ##
==========================================
- Coverage 76.73% 76.47% -0.26%
==========================================
Files 163 163
Lines 81098 81129 +31
==========================================
- Hits 62228 62046 -182
- Misses 18870 19083 +213
🚀 New features to boost your workflow:
|
…3920) ## Problem A crafted zipmap entry can set the value length to a value near `UINT32_MAX` so that the `l + e` sum (value length + one-byte free space) wraps in `unsigned int` arithmetic. The wrapped sum advances the validation cursor by a tiny amount, leaving `p` inside the buffer, so the `OUT_OF_RANGE` check passes and `zipmapValidateIntegrity` wrongly returns success. The field-length path has the same shape — advancing `p` by a ~4GB length wraps the pointer on 32-bit builds. `zipmapValidateIntegrity` is always called with `deep=1` from `rdb.c` when loading `RDB_TYPE_HASH_ZIPMAP`, **including via `RESTORE`**, so any client with `RESTORE` access can submit a payload that passes validation. On 32-bit platforms this leads to out-of-bounds access during the subsequent zipmap→listpack conversion. On 64-bit the downstream `lpSafeToAdd` cap happens to reject it (the raw ~4GB length exceeds `LISTPACK_MAX_SAFETY_SIZE`), but the validator should not accept a malformed payload in the first place — this is the function whose sole job is to reject it. ## Fix Bounds-check the attacker-controlled length against the bytes remaining in the zipmap, in 64-bit space, **before** any pointer arithmetic, for both the field-length and value-length paths. ## Testing - `tests/integration/corrupt-dump.tcl`: a `RESTORE`-path test exercising the full attack surface; asserts rejection and that the server stays up. - Verified the test **fails on the pre-fix code** (validator accepts the value-length payload) and **passes after the fix**, confirmed by stashing the fix during the integration run. - Full `integration/corrupt-dump` suite: 76 passed, 0 failed. > [!NOTE] > Found via structure-aware fuzzing of the RESTORE path. This issue was generated by AI but verified, with love, by a human. Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
…#3921) ## Problem A crafted `RESTORE` payload can store a `NAN` score in a listpack-encoded sorted set. The integrity validation (`lpValidateIntegrityAndDups`) only checks the listpack *structure* and member uniqueness — it does not check score validity — so the payload is accepted on load. When the sorted set is later converted to a skiplist (e.g. when it grows past `zset-max-listpack-entries`, or via any operation that triggers conversion), `zslInsertNode()` asserts the score is not `NAN` (`t_zset.c:260`) and the server aborts. **Any client with `RESTORE` access can remotely crash the server.** The skiplist RDB format (`RDB_TYPE_ZSET` / `RDB_TYPE_ZSET_2`) already rejects `NAN` scores at load time (`rdb.c`, "Zset with NAN score detected"). The listpack format (`RDB_TYPE_ZSET_LISTPACK`) had no equivalent check. ## Reproduction ``` RESTORE k 0 "\x11\x19\x19\x00\x00\x00\x04\x00\x82m1\x03\x83nan\x04\x82m2\x03\x832.5\x04\xFF\x50\x00...." # loads OK, then: ZADD k 9 x # forces listpack->skiplist conversion -> serverAssert(!isnan(node->score)) -> SIGABRT ``` ## Fix Add `zzlValidateScores()`, which scans the scores of a listpack zset after structural validation and rejects the payload if any score is `NAN`. Mirrors the existing skiplist-format check. `inf`/`-inf` and large finite scores remain accepted (only `NAN` is rejected), matching normal `ZADD` semantics. ## Testing - `tests/integration/corrupt-dump.tcl`: a `RESTORE`-path test asserting rejection and that the server stays up. - Verified the test **fails on the pre-fix code** (server crashes on conversion) and **passes after the fix**, by stashing the fix during the run. - Confirmed valid zsets, including `inf`/`-inf`/large finite scores, still load and convert correctly. - Full `integration/corrupt-dump` suite: 74 passed, 0 failed. > [!NOTE] > Found via structure-aware fuzzing of the RESTORE path. This issue was generated by AI but verified, with love, by a human. Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
The Case 3 portion of the test was flaky: after a single round of
`CLUSTER DELSLOTS 0` on R0/R1/R2, the cluster could stay in OK state
and `wait_for_cluster_state fail` would time out with
`Cluster node 1 cluster_state:ok`.
The race is between R0's local DELSLOTS and the gossip already in
flight from R0. After R1 locally clears slot 0, a stale pre-DELSLOTS
packet from R0 (whose myslots still claims slot 0) hits the
isSlotUnclaimed fast path in clusterUpdateSlotsConfigWith and rebinds
slot 0 back to R0 on R1. See:
```
if (isSlotUnclaimed(j) ||
server.cluster->slots[j]->configEpoch < senderConfigEpoch ||
clusterSlotFailoverGranted(j)) {
...
clusterDelSlot(j);
clusterAddSlot(sender, j);
...
}
```
R0's subsequent "no longer claiming" PINGs cannot undo this, because
that path only sets owner_not_claiming_slot and never clears slots[j]:
```
if (server.cluster->slots[j] == sender) {
/* The slot is currently bound to the sender but the sender is no longer
* claiming it. We don't want to unbind the slot yet as it can cause the cluster
* to move to FAIL state and also throw client error. Keeping the slot bound to
* the previous owner will cause a few client side redirects, but won't throw
* any errors. We will keep track of the uncertainty in ownership to avoid
* propagating misinformation about this slot's ownership using UPDATE
* messages. */
bitmapSetBit(server.cluster->owner_not_claiming_slot, j);
}
```
Combined with clusterUpdateState's full-coverage check looking only
at slots[j] == NULL, R1 stays at cluster OK forever.
```
if (server.cluster->slots[j] == NULL || ...) {
new_state = CLUSTER_FAIL;
...
}
```
Rather than fighting the protocol's intentional asymmetry around
"soft delete" via gossip, just retry the DELSLOTS pass until all
three nodes converge to FAIL. This keeps the test focused on the
CLUSTERSCAN error semantics it actually wants to verify.
This closes #3891. The test was added in #3674.
Signed-off-by: Binbin <binloveplay1314@qq.com>
## Summary `addReplyCommandSubCommands` unconditionally called `addReplySetLen(c, 0)` when a command has no subcommands, emitting a RESP3 Set type prefix (`~0`) regardless of the `use_map` parameter. The non-empty path (below it) already branches correctly on `use_map` — the empty early-return was simply missing the same logic. In RESP3, `COMMAND INFO <cmd>` returns the subcommands field as a Set (`~0`) instead of an Array (`*0`) for any command without subcommands (e.g. PING). Strict RESP3 client libraries that dispatch on collection type will misinterpret the response. Not visible in RESP2 since both Set and Array use the `*` prefix there. ## Fix Apply the same `use_map` branch to the empty case: - `addReplyMapLen(c, 0)` when `use_map=1` - `addReplyArrayLen(c, 0)` otherwise ## Test Added a `readraw` integration test in `tests/unit/introspection-2.tcl` that inspects the raw wire-level type byte for the subcommands field of `COMMAND INFO ping` in RESP3 mode, asserting `*0` (Array) rather than `~0` (Set). Signed-off-by: Rick Ramsay <49293857+rickrams@users.noreply.github.com> Signed-off-by: rickrams <rickrams@amazon.com>
`off_t` (64-bit), but were read into `int` (32-bit) locals in `genValkeyInfoString()` and `handleBioThreadFinishedRDBDownload()`. This causes INFO replication to report negative `master_sync_total_bytes` during bio disk-based sync when RDB exceeds 2GB. Fix: change the local variable types from `int` to `off_t`. Signed-off-by: chx9 <lovelypiska@outlook.com>
255 is too short if valkey-server is being run from a long path, especially if many fields are included such as in the "Process title set as expected" test in `unit/other.tcl`, where the max length for the cmdline is 1024, so having the same number in both makes sense. Closes #3832. Signed-off-by: Petr Khartskhaev <pkhartsk@redhat.com>
The `test io-threads are runtime modifiable` test in `tests/unit/other.tcl` times out on the dedicated Valgrind jobs of the daily CI, failing the run. The failing test is introduced in #3938. This PR reduces the loop to 10 iterations under Valgrind. **Failure links:** - https://github.com/valkey-io/valkey/actions/runs/27386948127 (Jun 12) - https://github.com/valkey-io/valkey/actions/runs/27315974006 (Jun 11) - https://github.com/valkey-io/valkey/actions/runs/27245311034 (Jun 10) --------- Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Add explicit validation for the FIELDS token before parsing numfields. Fixes #4045. Signed-off-by: lcxn123 <156999965+lcxn123@users.noreply.github.com> Co-authored-by: Binbin <binloveplay1314@qq.com>
Some functions are newly deprecated in OpenSSL 4.0. This commit works around those. More specifically: - X509_cmp_current_time() in isCertValid() is no longer necessary as X509_check_certificate_times() compares the notBefore and notAfter on its own. Had to add a version check since this function is new in OpenSSL 4.0. - Replace deprecated X509_NAME_get_text_by_NID. Not a perfect fix, because the new implementation still assumes that the name does not contain embedded null characters which may not be true, e.g., if the name is of type UniversalString or BMPString. - Also fix constness of X509_get_subject_name return value. The latter two fixes are taken from this Fedora patch: https://src.fedoraproject.org/rpms/valkey/c/1a9c8847172ef3fb116a1e2fdb3871692378adae?branch=rawhide Closes #4012 Signed-off-by: Petr Khartskhaev <pkhartsk@redhat.com>
AI conflict resolution: source PR #4016Fix build warnings with OpenSSL 4.0 1 conflicted file Claude Summary
AI-resolved conflicted files
Full backport commit diff: commit d9c38d207f09.
|
#4047) ### Problem `HRANDFIELD` with a positive `count` can enter an infinite loop (CASE 4) when the number of non-expired fields in a hash is less than `count`. **Example:** A hash has 100 fields total: 71 with short TTLs (already expired) and 29 persistent fields. A client runs: ``` HRANDFIELD myhash 30 ``` Since `count * 3 = 90 < 100 = size`, CASE 4 is selected. The loop calls `hashTypeRandomElement()`, which skips expired fields (via `validateEntry`), so it always returns one of the 29 valid fields. `hashTypeRandomElement()` retries up to 100 times internally to find a non-expired field. However, as long as any valid fields exist, this retry limit is rarely exhausted. After collecting all 29 unique valid fields, every subsequent `hashtableAdd()` returns false (duplicate), but `added = 29 < count = 30`, so the loop never breaks. This is reproducible when active expiry hasn't cleaned up the expired fields yet (e.g. on a replica, or during a burst of expirations). ### Fix Add a `maxiter = count * 10` limit to the CASE 4 loop. When the iteration budget is exhausted, return whatever unique fields have been collected so far — returning fewer results is acceptable. Signed-off-by: cjx-zar <jxchenczar@foxmail.com> Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Binbin <binloveplay1314@qq.com>
moduleNotifyKeyspaceEvent() runs subscriber callbacks on the executing client and calls selectDb(dbid) on it, but never restores the DB. When dbid differs from the client's current DB (e.g. MOVE/COPY notify on the destination DB), the client is left on the wrong DB, breaking subsequent commands. This was introduced by #1819, which started reusing the executing client instead of a temporary one. Save the executing client's DB before the subscriber loop and restore it afterwards. Covers both MOVE and COPY. Signed-off-by: Binbin <binloveplay1314@qq.com>
When loading a stream consumer group from RDB, the loader assigns each NACK's consumer pointer without checking if it was already set. A corrupt RDB payload can list the same message ID under multiple consumers' PELs, resulting in a shared NACK. This leads to use-after-free when one consumer is deleted while another still references the same NACK. Add a check that nack->consumer is NULL before assigning. If already set, reject the RDB as corrupt. Signed-off-by: Saurabh Kher <saurabh@amazon.com>
…s disconnected (#4068) Prevent a segfault that can happen during a sentinel coordinated failover (a new feature introduced in #1292) that can happen when the coordinated failover attempting to clean up the former leader races with that leader disconnecting. Add a check for `link->cc` in `sentinelKillClients`. Otherwise the sentinel node will segfault in this case. Fixes #4066. Signed-off-by: Luke Palmer <luke@lukepalmer.net> Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Binbin <binloveplay1314@qq.com>
Backport sweep for 9.1
Automated cherry-picks from PRs marked "To be backported".
Applied
AI resolution details are posted as comments on this PR when available.
Generated by valkey-ci-agent using Claude Code.