Reduce io-threads modifiability test iterations under Valgrind by sarthakaggarwal97 · Pull Request #3980 · valkey-io/valkey

sarthakaggarwal97 · 2026-06-12T03:40:28Z

The test io-threads are runtime modifiable test in tests/unit/other.tcl times out on the dedicated Valgrind jobs of the daily CI, failing the run. The failing test is introduced in #3938.

This PR reduces the loop to 10 iterations under Valgrind.

Failure links:

The 'test io-threads are runtime modifiable' test in unit/other toggles io-threads 100 times. Each toggle tears down and respawns real pthreads (pthread_create on grow, pthread_cancel + pthread_join on shrink) plus a drainIOThreadsQueue() pass, which is far slower under Valgrind. With the IO-threads redesign in place, the full run no longer fits in the test timeout on the dedicated Valgrind jobs. The same timeout appeared while the redesign was originally on unstable, went away when it was reverted, and returned once it was relanded. Reduce the loop to 10 iterations under Valgrind while keeping the full 100 in normal runs, preserving memcheck coverage of the thread spawn/teardown path. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

coderabbitai · 2026-06-12T03:40:43Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: b609820c-537d-4f4a-bec0-88f4e22e05c3

📥 Commits

Reviewing files that changed from the base of the PR and between 62b7ff3 and d593da9.

📒 Files selected for processing (2)

tests/support/server.tcl
tests/test_helper.tcl

📝 Walkthrough

Walkthrough

Test server spawn notifications are restructured to pass log file paths, enabling test_helper to track and dump server logs on timeout for diagnostics. Additionally, the io-threads runtime modifiability test uses Valgrind-aware iteration count to reduce execution time under Valgrind.

Changes

Server Log Diagnostics on Timeout

Layer / File(s)	Summary
Structured server-spawned payload and handler update `tests/support/server.tcl`, `tests/test_helper.tcl`	`spawn_server` sends server-spawned as a list containing PID, stdout log path, and curfile instead of a single formatted string. The test_helper message handler parses this list to extract and record the log path.
Log path tracking and timeout diagnostics `tests/test_helper.tcl`	New global array `::server_logs` maps server PIDs to log file paths. New `dump_server_logs` procedure iterates active servers and prints the last 100 lines of each log for diagnostics. Timeout handler calls `dump_server_logs` before terminating servers.

IO-threads Test Performance Optimization

Layer / File(s)	Summary
Valgrind-aware iteration count for io-threads test `tests/unit/other.tcl`	The io-threads runtime modifiability test now uses a conditional iteration count: 10 iterations under Valgrind, 100 otherwise, replacing the hardcoded loop count.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

enjoy-binbin

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately reflects the main change: reducing test iterations under Valgrind to address timeout issues.
Description check	✅ Passed	The description clearly explains the problem (test timeout under Valgrind), the solution (reducing iterations to 10), and provides evidence with three CI failure links.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-06-12T04:04:33Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.91%. Comparing base (436dcae) to head (bb20d24).

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #3980      +/-   ##
============================================
+ Coverage     76.75%   76.91%   +0.15%     
============================================
  Files           162      162              
  Lines         81017    81017              
============================================
+ Hits          62187    62311     +124     
+ Misses        18830    18706     -124

see 25 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

zuiderkwast

Nice! Did you run daily on your branch to check that it fixes it?

enjoy-binbin · 2026-06-12T08:47:43Z

-        for {set i 0} {$i < 100} {incr i} {
+        for {set i 0} {$i < $iterations} {incr i} {
            set random_num [expr {int(rand() * 5) + 1}]
            r config set io-threads $random_num


Why 100 config set will timed out, i can see the cluster is a fresh one and the test doesn't do anything at all. Is it really that slow? I am worry that maybe a bug in the code.

And btw, the timeout test don't print the server logs makes troubleshooting difficult.

Let's make the test framework print the server logs on timeout?

Yeah, I'm far more worried there is a bug because we recently made a change. The timeout message is related to the infra not able to get a response. Since we aren't seeing this widespread, I think something might actually be getting wedged.

Apologies for the delay, I was occupied with some other things. I tried running more tests for this particular test.

First, I ran the old 100-iteration behavior with this PR with engine logs and the normal Valgrind timeout. Both Valgrind jobs timed out: https://github.com/sarthakaggarwal97/valkey/actions/runs/27724594959

The useful part is that the new timeout logging now prints the server log before killing the server. The logs show the server was still making slow progress through CONFIG SET io-threads over the timeout window, with repeated:

Changing number of IO threads...

IO thread(...) terminated

I also ran the same restored 100-iteration test without Valgrind on GitHub Actions: https://github.com/sarthakaggarwal97/valkey/actions/runs/27726579423

That passed. The io-thread modifiability test took only 32 ms, and unit/other completed in 30 seconds: https://github.com/sarthakaggarwal97/valkey/actions/runs/27726579423

Finally, the current PR code, which keeps 100 iterations for normal runs but reduces Valgrind to 10 iterations, passes the focused Valgrind run: https://github.com/sarthakaggarwal97/valkey/actions/runs/27577339753

The issue appears to be general to Valgrind slowness but we can do more tests to verify if someone has any suggestions.

I suspect that we starved the main thread while running under Valgrind.

/* Ignition Policy */ if (server.active_io_threads_num == 1) { int should_ignite = 0; float main_thread_active_time = (float)getInstantaneousMetric(STATS_METRIC_MAIN_THREAD_ACTIVE_TIME) / 10000.0; /* Ignite IO threads when main-thread active time exceeds the threshold (30%) */ should_ignite = (main_thread_active_time > (float)IO_IGNITION_MAIN_THREAD_ACTIVE_PERCENT); if (should_ignite) { pthread_mutex_unlock(&io_threads_mutex[1]); server.active_io_threads_num++; last_scale_time = now; serverLog(LL_DEBUG, "IO threads ignition: increased to %d", server.active_io_threads_num); } return; }

When a test times out, the harness killed the server without ever printing its log, making timeouts hard to troubleshoot. The orchestrator only tracked server pids, not their log paths. Include the server's stdout log path in the server-spawned message and track it per-pid in the orchestrator. On timeout, dump the tail of each still-running server's log before the servers are killed. Addresses review feedback on valkey-io#3980. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

github-actions Bot assigned sarthakaggarwal97 Jun 12, 2026

sarthakaggarwal97 added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Jun 12, 2026

sarthakaggarwal97 requested a review from madolson June 12, 2026 04:35

zuiderkwast approved these changes Jun 12, 2026

View reviewed changes

enjoy-binbin reviewed Jun 12, 2026

View reviewed changes

sarthakaggarwal97 added 2 commits June 15, 2026 14:54

Merge branch 'unstable' into deflake-io-threads-valgrind-timeout

bb20d24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce io-threads modifiability test iterations under Valgrind#3980

Reduce io-threads modifiability test iterations under Valgrind#3980
sarthakaggarwal97 wants to merge 3 commits into
valkey-io:unstablefrom
sarthakaggarwal97:deflake-io-threads-valgrind-timeout

sarthakaggarwal97 commented Jun 12, 2026

Uh oh!

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Uh oh!

codecov Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

zuiderkwast left a comment

Uh oh!

enjoy-binbin Jun 12, 2026 •

edited

Loading

Uh oh!

zuiderkwast Jun 12, 2026

Uh oh!

madolson Jun 15, 2026

Uh oh!

sarthakaggarwal97 Jun 18, 2026

Uh oh!

enjoy-binbin Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

sarthakaggarwal97 commented Jun 12, 2026

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Uh oh!

codecov Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

zuiderkwast left a comment

Choose a reason for hiding this comment

Uh oh!

enjoy-binbin Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zuiderkwast Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

madolson Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

sarthakaggarwal97 Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

enjoy-binbin Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading

codecov Bot commented Jun 12, 2026 •

edited

Loading

enjoy-binbin Jun 12, 2026 •

edited

Loading