Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughReplaces the DIANN-to-mzTab tool with a DIANN-to-MSstats tool, removes the old diann2mztab module, tightens Bruker .d and mzML processing null-safety, makes PSM engine matching and MS2 handling more robust, refactors sample extraction row assembly, adds CI steps to test Bruker .d files, expands unit tests, and bumps package version and sdrf-pipelines requirement. Changes
Sequence Diagram(s)sequenceDiagram
participant CI as "GitHub Actions"
participant PRIDE as "PRIDE storage"
participant Runner as "CI runner"
participant Tool as "quantmsutilsc (mzmlstats)"
participant Code as "quantmsutils.mzml_statistics"
participant DB as "Bruker tdf DB / files"
CI->>PRIDE: download Bruker .d archive
PRIDE-->>Runner: deliver archive
Runner->>Runner: extract -> folder with analysis.tdf
Runner->>Tool: run `quantmsutilsc mzmlstats` on extracted .d
Tool->>Code: invoke Bruker .d handler
Code->>DB: open/read `analysis.tdf` (check exists)
DB-->>Code: return acquisition metadata / scans
Code-->>Tool: emit parquet / ms_info outputs
Tool-->>CI: return step result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/python-package.yml:
- Around line 57-60: The "Download Bruker .d test file" step downloads and
extracts hMICAL1_coiPAnP-N2-200_3Murea-1Mthiourea-200mMtcep_14733.d.tar.gz
without verifying its integrity; change the step to fetch a known checksum (or
include an expected SHA256 value) and verify it before extraction (use sha256sum
--check or compare computed sha256 against the expected value) and fail the job
on mismatch so the tar -xzf command only runs after successful checksum
verification.
In `@quantmsutils/diann/diann2mztab.py`:
- Around line 314-315: The check combining existence and directory-ness should
be split so callers can distinguish "not found" from "not a directory": replace
the compound test on self.base_path in the class/method that contains this check
with two checks — first if not self.base_path.exists() raise
FileNotFoundError(f"Path {self.base_path} does not exist"), then elif not
self.base_path.is_dir() raise NotADirectoryError(f"Path {self.base_path} is not
a directory"); reference self.base_path and the surrounding method
(diann2mztab.py) when making the change and remove the combined-message form.
In `@quantmsutils/sdrf/extract_sample.py`:
- Around line 40-48: The DataFrame creation in extract_sample.py (variable
sample_dt) can yield a headerless CSV when rows is empty; ensure the output
schema is preserved by creating the DataFrame with explicit columns (e.g.,
supply columns=["Spectra_Filepath", "Sample"] or reindex after creation) when
building sample_dt from rows (the loop over f_table.iterrows uses s_data_frame
to lookup MSstats_Mixture and appends {"Spectra_Filepath","Sample"}), so that
even if rows remains empty the resulting DataFrame has the expected two columns.
In `@tests/test_commands.py`:
- Around line 157-168: The failing test test_convert_psm_without_ms2 shows
psmconvert crashes when no ms2_file is provided; update the psmconvert CLI
handler (the code path exercised by run_cli_command("psmconvert", ...)) to
defensively handle a missing/None ms2_file: check for None before dereferencing
(any code that accesses ms2_file.* or opens it), skip MS2-dependent processing
steps or use an empty/default value, and surface a graceful exit (return success
when MS2 is optional) or a clear error message instead of raising; locate the
ms2 handling logic in the psm conversion function (the CLI entrypoint and
functions that parse/open ms2 files) and add the None guard and appropriate
fallback behavior so the test succeeds.
- Around line 151-156: Replace the conditional check that skips validation when
the output file is missing by asserting the file was created first (assert
output_file.exists()) and then proceed to read it into a DataFrame (df =
pd.read_parquet(output_file)) and assert its contents (len(df) > 0 and required
columns "sequence" and "scan_number" exist); update the test around the
output_file variable so a missing file fails the test instead of silently
passing.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (3)
.DS_Storeis excluded by!**/.DS_Storequantmsutils/.DS_Storeis excluded by!**/.DS_Storetests/.DS_Storeis excluded by!**/.DS_Store
📒 Files selected for processing (7)
.github/workflows/python-package.yml1.mzMLquantmsutils/diann/diann2mztab.pyquantmsutils/mzml/mzml_statistics.pyquantmsutils/psm/psm_conversion.pyquantmsutils/sdrf/extract_sample.pytests/test_commands.py
There was a problem hiding this comment.
🧹 Nitpick comments (1)
recipe/meta.yaml (1)
26-27: Minor: Inconsistent version-specifier spacing acrossrundependencies.
sdrf-pipelines >=0.0.33,<0.1.0now has a space before>=, whilepyopenms>=3.3.0(Line 27) andpyarrow>=16.1.0(Line 29) do not. While conda accepts both forms, consistent formatting across all entries improves readability.🔧 Proposed fix (pick one style and apply uniformly)
- - sdrf-pipelines >=0.0.33,<0.1.0 - - pyopenms>=3.3.0 + - sdrf-pipelines>=0.0.33,<0.1.0 + - pyopenms>=3.3.0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@recipe/meta.yaml` around lines 26 - 27, The run dependencies have inconsistent spacing around version specifiers: change the entry for sdrf-pipelines from "sdrf-pipelines >=0.0.33,<0.1.0" to match the others (e.g., "sdrf-pipelines>=0.0.33,<0.1.0") so all version specifiers (sdrf-pipelines, pyopenms, pyarrow) use the same no-space style; update the string in recipe/meta.yaml where the run: dependencies list is defined (look for the sdrf-pipelines, pyopenms, pyarrow entries) to apply this formatting uniformly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@recipe/meta.yaml`:
- Around line 26-27: The run dependencies have inconsistent spacing around
version specifiers: change the entry for sdrf-pipelines from "sdrf-pipelines
>=0.0.33,<0.1.0" to match the others (e.g., "sdrf-pipelines>=0.0.33,<0.1.0") so
all version specifiers (sdrf-pipelines, pyopenms, pyarrow) use the same no-space
style; update the string in recipe/meta.yaml where the run: dependencies list is
defined (look for the sdrf-pipelines, pyopenms, pyarrow entries) to apply this
formatting uniformly.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
README.md (1)
37-37:⚠️ Potential issue | 🟡 MinorFix pre-existing typos in documentation.
Three typos in the documentation body:
- Line 37:
gow→how- Line 59:
pyopnems→pyopenms- Line 73:
intesities→intensities✏️ Proposed fixes
-quantms-utils have multiple scripts to generate mzML stats. These files are used by multiple tools and packages within quantms ecosystem for quality control, mzTab generation, etc. Here are some details about the formats, the fields they contain and gow they are computed. +quantms-utils have multiple scripts to generate mzML stats. These files are used by multiple tools and packages within quantms ecosystem for quality control, mzTab generation, etc. Here are some details about the formats, the fields they contain and how they are computed.-> **NOTE**: For all the precursor-related information, we are using the first precursor in the spectrum. The following columns `intensity` (if not annotated), `precursor_rt`, and `precursor_total_intensity` we use the following pyopnems code: +> **NOTE**: For all the precursor-related information, we are using the first precursor in the spectrum. The following columns `intensity` (if not annotated), `precursor_rt`, and `precursor_total_intensity` we use the following pyopenms code:-`mzmlstats` allows the user to produce a file containing all the MS2 spectra including the intesities and masses of every peak. +`mzmlstats` allows the user to produce a file containing all the MS2 spectra including the intensities and masses of every peak.Also applies to: 59-59, 73-73
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@README.md` at line 37, Update the README.md text to correct the three typos: change "gow" to "how" in the sentence beginning "quantms-utils have multiple scripts..." (line containing "gow they are computed"), change "pyopnems" to "pyopenms" where it appears at line ~59, and change "intesities" to "intensities" at line ~73; ensure only the misspelled words are replaced and spacing/punctuation remains unchanged.
♻️ Duplicate comments (1)
quantmsutils/diann/diann2mztab.py (1)
288-289: Split the compound path check into distinct exception types — still unresolved from the previous review.The combined condition maps two distinct failure modes onto a single
NotADirectoryError, preventing callers from distinguishing a missing path from an existing non-directory path.FileNotFoundErroris the idiomatic and consistent choice (see the sibling check on Line 292) when the path does not exist.🛡️ Proposed fix
- if not self.base_path.exists() or not self.base_path.is_dir(): - raise NotADirectoryError(f"Path {self.base_path} does not exist or is not a directory") + if not self.base_path.exists(): + raise FileNotFoundError(f"Path {self.base_path} does not exist") + if not self.base_path.is_dir(): + raise NotADirectoryError(f"Path {self.base_path} is not a directory")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@quantmsutils/diann/diann2mztab.py` around lines 288 - 289, The compound check on self.base_path should be split so callers can distinguish a missing path from a non-directory: first test self.base_path.exists() and raise FileNotFoundError(f"Path {self.base_path} does not exist") if false, then test self.base_path.is_dir() and raise NotADirectoryError(f"Path {self.base_path} exists but is not a directory") if false; update the code around the existing check in diann2mztab.py (the block currently raising NotADirectoryError) to implement these two separate checks and messages.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@README.md`:
- Line 37: Update the README.md text to correct the three typos: change "gow" to
"how" in the sentence beginning "quantms-utils have multiple scripts..." (line
containing "gow they are computed"), change "pyopnems" to "pyopenms" where it
appears at line ~59, and change "intesities" to "intensities" at line ~73;
ensure only the misspelled words are replaced and spacing/punctuation remains
unchanged.
---
Duplicate comments:
In `@quantmsutils/diann/diann2mztab.py`:
- Around line 288-289: The compound check on self.base_path should be split so
callers can distinguish a missing path from a non-directory: first test
self.base_path.exists() and raise FileNotFoundError(f"Path {self.base_path} does
not exist") if false, then test self.base_path.is_dir() and raise
NotADirectoryError(f"Path {self.base_path} exists but is not a directory") if
false; update the code around the existing check in diann2mztab.py (the block
currently raising NotADirectoryError) to implement these two separate checks and
messages.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
README.mdquantmsutils/diann/diann2mztab.pyquantmsutils/psm/psm_conversion.py
🚧 Files skipped from review as they are similar to previous changes (1)
- quantmsutils/psm/psm_conversion.py
There was a problem hiding this comment.
Actionable comments posted: 5
♻️ Duplicate comments (1)
tests/test_commands.py (1)
151-156: Output file validation is still conditional and can mask failures.This was already flagged before: if the file is not created, the test can still pass by skipping assertions.
💡 Proposed fix
- if output_file.exists(): - df = pd.read_parquet(output_file) - assert len(df) > 0, "PSM output parquet is empty" - assert "sequence" in df.columns - assert "scan_number" in df.columns + assert output_file.exists(), "PSM output parquet was not created" + df = pd.read_parquet(output_file) + assert len(df) > 0, "PSM output parquet is empty" + assert "sequence" in df.columns + assert "scan_number" in df.columns🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_commands.py` around lines 151 - 156, The test currently guards the parquet checks behind "if output_file.exists()", which hides missing-file failures; replace that conditional with an explicit assertion on the file's existence (assert output_file.exists()) and then unconditionally read the file with pd.read_parquet(output_file) into df and run the existing assertions (len(df) > 0, "sequence" in df.columns, "scan_number" in df.columns) so a missing output file or empty/invalid contents fails the test; update the block referencing output_file and df accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@quantmsutils/diann/diann2msstats.py`:
- Line 37: The CLI option --qvalue_threshold (declared with
`@click.option`("--qvalue_threshold", "-q", type=float)) must be made required or
given a safe default so the later comparison report["Q.Value"] <
qvalue_threshold in diann2msstats.py (used around the block referencing
report["Q.Value"]) cannot fail; update the `@click.option` for qvalue_threshold to
either required=True or add a sensible default (e.g., default=0.01) and ensure
the variable name qvalue_threshold remains unchanged so downstream checks use a
guaranteed float value.
- Around line 188-193: The report() function currently always prefers
"report.tsv" then "report.parquet", which can mismatch later parser selection
(lines that call read_csv/read_parquet around the parser selection block), so
change report() to return the actual report path along with its extension or
otherwise detect the existing file type and use that to choose the parser;
specifically, use find_first_file_with_suffix("report.tsv") and
find_first_file_with_suffix("report.parquet") to check which file exists and
return the one that exists (or raise a clear error if both exist), then update
the parser-selection logic (the code that decides between read_csv and
read_parquet, referencing read_csv, read_parquet, and the DIA-NN version check)
to pick the reader based on the returned file's extension rather than assuming a
version mapping.
- Around line 124-127: The helper _true_stem currently uses
os.path.basename(x).split(".")[0], which wrongly truncates filenames at the
first dot; replace that with os.path.splitext(os.path.basename(x))[0] so only
the final extension is removed (handles names like sample.v1.mzML correctly).
Apply the same fix to the other occurrence that uses split(".")[0] (search for
any other uses in this module) to ensure Run mapping and downstream joins keep
the full stem except for the final extension.
- Around line 103-118: The current merge of out_msstats with the derived
sample/run mapping (using s_data_frame and f_table) is an inner join on "Run" so
unmatched runs in out_msstats get dropped; change the merge call on
out_msstats.merge(...) to use how="left" (preserve all rows from out_msstats)
and keep validate="many_to_one", then immediately after the merge check the
merged columns (Condition or BioReplicate) for missing/NaN values and raise or
log a clear error including the offending Run IDs so mapping mismatches are
surfaced (refer to symbols: out_msstats, s_data_frame, f_table, merge, "Run",
"Condition", "BioReplicate").
In `@README.md`:
- Around line 22-23: The README line describing the diann2msstats command uses
plural "output formats" but the command now produces a single output pathway;
update the wording to singular to match the command scope by changing the phrase
"The output formats are used for quality control and downstream analysis in
quantms." to a singular form (e.g., "The output format is used for quality
control and downstream analysis in quantms.") so the entry for `diann2msstats`
reads consistently.
---
Duplicate comments:
In `@tests/test_commands.py`:
- Around line 151-156: The test currently guards the parquet checks behind "if
output_file.exists()", which hides missing-file failures; replace that
conditional with an explicit assertion on the file's existence (assert
output_file.exists()) and then unconditionally read the file with
pd.read_parquet(output_file) into df and run the existing assertions (len(df) >
0, "sequence" in df.columns, "scan_number" in df.columns) so a missing output
file or empty/invalid contents fails the test; update the block referencing
output_file and df accordingly.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
README.mdquantmsutils/diann/diann2msstats.pyquantmsutils/diann/diann2mztab.pyquantmsutils/quantmsutilsc.pyrecipe/meta.yamltests/test_commands.py
💤 Files with no reviewable changes (1)
- quantmsutils/diann/diann2mztab.py
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR replaces the complex multi-target diann2mztab command with a streamlined diann2msstats command focused solely on MSstats output conversion. It also includes critical bug fixes for Bruker .d file handling and modernizes deprecated pandas code. Version is bumped to 0.0.25 with updated dependencies including Python >=3.10 and setuptools <78 constraints.
Changes:
- Removed
diann2mztabcommand and replaced with simplifieddiann2msstatscommand (breaking change) - Fixed critical bugs in Bruker .d file processing (if/elif chain, operator precedence, missing file checks, null handling)
- Fixed PSM conversion to handle optional ms2_file parameter and use flexible search engine matching
- Replaced deprecated pandas DataFrame.append() with list accumulation pattern
- Added comprehensive tests for new functionality including Bruker .d workflow, PSM modification parsing, and sample extraction
- Updated CI to test Bruker .d files and Python dependency constraints
Reviewed changes
Copilot reviewed 11 out of 25 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| quantmsutils/diann/diann2msstats.py | New simplified DIA-NN to MSstats converter (151 lines) |
| quantmsutils/diann/diann2mztab.py | Entire file removed (1645 lines deleted) |
| quantmsutils/quantmsutilsc.py | Updated CLI import from diann2mztab to diann2msstats |
| quantmsutils/mzml/mzml_statistics.py | Fixed critical bugs: if/elif exclusivity, operator precedence for intensity check, null handling for acquisition date, missing file check |
| quantmsutils/psm/psm_conversion.py | Fixed None handling for optional ms2_file, flexible search engine matching, improved output path construction |
| quantmsutils/sdrf/extract_sample.py | Replaced deprecated DataFrame.append with list accumulation |
| tests/test_commands.py | Added tests for diann2msstats, PSM conversion without ms2, Bruker .d processing, mods_position function, and MSstats_Mixture handling |
| tests/test_data/diann2msstats/* | Added test data files (parquet, TSV, YAML) |
| recipe/meta.yaml | Version 0.0.25, added setuptools <78 constraint, updated Python to >=3.10,<3.13, updated sdrf-pipelines to >=0.0.33,<0.1.0 |
| pyproject.toml | Version bump to 0.0.25 |
| README.md | Updated documentation from diann2mztab to diann2msstats |
| .gitignore | Removed *.tsv exclusion, updated test data paths from diann2mztab to diann2msstats |
| .github/workflows/python-package.yml | Added Bruker .d test workflow step, fixed typo |
| .DS_Store, tests/.DS_Store, quantmsutils/.DS_Store | macOS system files (should not be committed) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| readme = "README.md" | ||
| license = "MIT" | ||
| version = "0.0.24" | ||
| version = "0.0.25" |
There was a problem hiding this comment.
The version in the PR title says 0.0.24, but the actual version changes in pyproject.toml and meta.yaml show 0.0.25. This discrepancy should be resolved - the title should match the actual version being released.
| - python >=3.10,<3.13 | ||
| - click | ||
| - sdrf-pipelines>=0.0.32 | ||
| - setuptools <78 |
There was a problem hiding this comment.
The sdrf-pipelines dependency has been updated from >=0.0.32 to >=0.0.33,<0.1.0. This is a minor version bump with an upper bound added. Consider documenting what features or fixes in sdrf-pipelines 0.0.33 are required for this release, especially if there are any breaking changes that necessitate the upper bound constraint.
| - setuptools <78 | |
| - setuptools <78 | |
| # sdrf-pipelines >=0.0.33 is required for features/fixes used by quantms-utils | |
| # (e.g. updated SDRF handling and CLI behavior relied on by quantmsutilsc). | |
| # The <0.1.0 upper bound is intentional because the 0.1.x series is expected | |
| # to introduce breaking changes to these APIs; loosen this bound once | |
| # quantms-utils has been verified against sdrf-pipelines >=0.1.0. |
| - setuptools <78 | ||
| run: | ||
| - python | ||
| - python >=3.10,<3.13 | ||
| - click | ||
| - sdrf-pipelines>=0.0.32 | ||
| - setuptools <78 | ||
| - sdrf-pipelines >=0.0.33,<0.1.0 | ||
| - pyopenms>=3.3.0 | ||
| - pandas | ||
| - pyarrow>=16.1.0 | ||
| - scipy | ||
| test: | ||
| requires: | ||
| - setuptools <78 |
There was a problem hiding this comment.
The setuptools version constraint <78 appears in three places (host, run, and test requirements). This is a significant constraint that should be documented. Consider adding a comment explaining why setuptools must be pinned below version 78, as this could help future maintainers understand the reason for this limitation.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Summary by CodeRabbit
Bug Fixes
New / Changed Features
Tests
Chores