Reduce io-threads modifiability test iterations under Valgrind#3980
Reduce io-threads modifiability test iterations under Valgrind#3980sarthakaggarwal97 wants to merge 3 commits into
Conversation
The 'test io-threads are runtime modifiable' test in unit/other toggles io-threads 100 times. Each toggle tears down and respawns real pthreads (pthread_create on grow, pthread_cancel + pthread_join on shrink) plus a drainIOThreadsQueue() pass, which is far slower under Valgrind. With the IO-threads redesign in place, the full run no longer fits in the test timeout on the dedicated Valgrind jobs. The same timeout appeared while the redesign was originally on unstable, went away when it was reverted, and returned once it was relanded. Reduce the loop to 10 iterations under Valgrind while keeping the full 100 in normal runs, preserving memcheck coverage of the thread spawn/teardown path. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughTest server spawn notifications are restructured to pass log file paths, enabling test_helper to track and dump server logs on timeout for diagnostics. Additionally, the io-threads runtime modifiability test uses Valgrind-aware iteration count to reduce execution time under Valgrind. ChangesServer Log Diagnostics on Timeout
IO-threads Test Performance Optimization
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## unstable #3980 +/- ##
============================================
+ Coverage 76.75% 76.91% +0.15%
============================================
Files 162 162
Lines 81017 81017
============================================
+ Hits 62187 62311 +124
+ Misses 18830 18706 -124 🚀 New features to boost your workflow:
|
zuiderkwast
left a comment
There was a problem hiding this comment.
Nice! Did you run daily on your branch to check that it fixes it?
| for {set i 0} {$i < 100} {incr i} { | ||
| for {set i 0} {$i < $iterations} {incr i} { | ||
| set random_num [expr {int(rand() * 5) + 1}] | ||
| r config set io-threads $random_num |
There was a problem hiding this comment.
Why 100 config set will timed out, i can see the cluster is a fresh one and the test doesn't do anything at all. Is it really that slow? I am worry that maybe a bug in the code.
And btw, the timeout test don't print the server logs makes troubleshooting difficult.
There was a problem hiding this comment.
Let's make the test framework print the server logs on timeout?
There was a problem hiding this comment.
Yeah, I'm far more worried there is a bug because we recently made a change. The timeout message is related to the infra not able to get a response. Since we aren't seeing this widespread, I think something might actually be getting wedged.
There was a problem hiding this comment.
Apologies for the delay, I was occupied with some other things. I tried running more tests for this particular test.
First, I ran the old 100-iteration behavior with this PR with engine logs and the normal Valgrind timeout. Both Valgrind jobs timed out: https://github.com/sarthakaggarwal97/valkey/actions/runs/27724594959
The useful part is that the new timeout logging now prints the server log before killing the server. The logs show the server was still making slow progress through CONFIG SET io-threads over the timeout window, with repeated:
Changing number of IO threads...IO thread(...) terminated
I also ran the same restored 100-iteration test without Valgrind on GitHub Actions: https://github.com/sarthakaggarwal97/valkey/actions/runs/27726579423
That passed. The io-thread modifiability test took only 32 ms, and unit/other completed in 30 seconds: https://github.com/sarthakaggarwal97/valkey/actions/runs/27726579423
Finally, the current PR code, which keeps 100 iterations for normal runs but reduces Valgrind to 10 iterations, passes the focused Valgrind run: https://github.com/sarthakaggarwal97/valkey/actions/runs/27577339753
The issue appears to be general to Valgrind slowness but we can do more tests to verify if someone has any suggestions.
There was a problem hiding this comment.
I suspect that we starved the main thread while running under Valgrind.
/* Ignition Policy */
if (server.active_io_threads_num == 1) {
int should_ignite = 0;
float main_thread_active_time = (float)getInstantaneousMetric(STATS_METRIC_MAIN_THREAD_ACTIVE_TIME) / 10000.0;
/* Ignite IO threads when main-thread active time exceeds the threshold (30%) */
should_ignite = (main_thread_active_time > (float)IO_IGNITION_MAIN_THREAD_ACTIVE_PERCENT);
if (should_ignite) {
pthread_mutex_unlock(&io_threads_mutex[1]);
server.active_io_threads_num++;
last_scale_time = now;
serverLog(LL_DEBUG, "IO threads ignition: increased to %d", server.active_io_threads_num);
}
return;
}
When a test times out, the harness killed the server without ever printing its log, making timeouts hard to troubleshoot. The orchestrator only tracked server pids, not their log paths. Include the server's stdout log path in the server-spawned message and track it per-pid in the orchestrator. On timeout, dump the tail of each still-running server's log before the servers are killed. Addresses review feedback on valkey-io#3980. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
The
test io-threads are runtime modifiabletest intests/unit/other.tcltimes out on the dedicated Valgrind jobs of the daily CI, failing the run. The failing test is introduced in #3938.This PR reduces the loop to 10 iterations under Valgrind.
Failure links: