Skip to content

Test framework: avoid stale tmpdir reuse on PID collision#3969

Draft
zuiderkwast wants to merge 1 commit into
valkey-io:unstablefrom
zuiderkwast:fix-test-dir-collision
Draft

Test framework: avoid stale tmpdir reuse on PID collision#3969
zuiderkwast wants to merge 1 commit into
valkey-io:unstablefrom
zuiderkwast:fix-test-dir-collision

Conversation

@zuiderkwast

Copy link
Copy Markdown
Contributor

If the OS reuses a PID, the tmpdir name can collide with a leftover directory from a previous test run, causing servers to load stale config (e.g. nodes.conf). Fix by skipping existing directories in tmpdir allocation.

Also remove nodes.conf in clean_persistence, which was missing when removing files after a start_server block where other files were removed. Cleaning doesn't affect runs with --dont-clean though, so both fixes are useful.


This problem was seen in this CI job, where a fresh server loaded a stale nodes.conf which confused the cluster and caused the test to fail: https://github.com/valkey-io/valkey/actions/runs/27279021315/job/80567998236?pr=3961#step:6:7995

If the OS reuses a PID, the tmpdir name can collide with a leftover
directory from a previous test run, causing servers to load stale
config (e.g. nodes.conf). Fix by skipping existing directories in
tmpdir allocation.

Also clean nodes.conf in clean_persistence as an additional, which
was missing when removing files after a start_server block.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 917c3fa6-0abe-4098-a38a-9cba39b7d64e

📥 Commits

Reviewing files that changed from the base of the PR and between f3bdf50 and 3cd139b.

📒 Files selected for processing (2)
  • tests/support/server.tcl
  • tests/support/tmpfile.tcl

📝 Walkthrough

Walkthrough

Two test infrastructure functions are updated: persistence cleanup now removes cluster configuration files, and temporary directory generation adds collision detection to avoid reusing existing paths.

Changes

Test Support Infrastructure

Layer / File(s) Summary
Cluster node configuration cleanup
tests/support/server.tcl
The clean_persistence function now deletes nodes.conf from the data directory during cleanup, alongside existing AOF/RDB data removal, with errors ignored if the file does not exist.
Temporary directory collision detection
tests/support/tmpfile.tcl
The tmpdir function now loops to generate directory paths until finding an unused path, preventing collisions by checking for existing directories.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: avoiding stale tmpdir reuse on PID collision, which directly aligns with the primary fix of skipping existing directories in tmpdir allocation.
Description check ✅ Passed The description is clearly related to the changeset, explaining the problem of OS PID reuse causing tmpdir collisions and detailing both fixes: skipping existing directories in tmpdir allocation and removing nodes.conf in clean_persistence.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@hpatro hpatro left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems no harm in looping through to find a non existing directory. But how did you determine if the PID was reused? I see the PID: 36285 getting printed out once.

@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.70%. Comparing base (f3bdf50) to head (3cd139b).

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #3969      +/-   ##
============================================
+ Coverage     76.67%   76.70%   +0.02%     
============================================
  Files           162      162              
  Lines         80752    80752              
============================================
+ Hits          61917    61941      +24     
+ Misses        18835    18811      -24     

see 15 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@zuiderkwast

Copy link
Copy Markdown
Contributor Author

Seems no harm in looping through to find a non existing directory. But how did you determine if the PID was reused? I see the PID: 36285 getting printed out once.

@hpatro Actually, no, I can't explain it. The PID is the test framework's client process (as in --clients) of which there are 16 by default and their lifetime spans the whole runtest run. They are started like exec $tclsh [info script] {*}$::argv --client $clientport &. Each client runs one test suite at a time. When the same client runs the next test suite starts, the counter isn't reset so it increments and can't collide with previous tests (unless it wraps, but we don't have that many start_server.) The directory names are on the form tests/tmp/server.$PID.$counter.

It could maybe happen if there are multiple runtest in the same CI job, with --dont-clean and --dont-pre-clean. We never run with --dont-pre-clean though.....

I'm no longer so sure about merging this PR. I'll leave it open for some time to see if we ever see this kind of failure again...

@zuiderkwast zuiderkwast marked this pull request as draft June 29, 2026 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants