Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthrough
Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip CodeRabbit can use OpenGrep to find security vulnerabilities and bugs across 17+ programming languages.OpenGrep is compatible with Semgrep configurations. Add an |
There was a problem hiding this comment.
Pull request overview
This PR updates the DIA-NN → MSstats converter to accept a newer “unified” (single-table) experimental design TSV in addition to the existing legacy two-table format (blank-line separated), by auto-detecting the format and routing to the appropriate parser.
Changes:
- Add format auto-detection in
get_exp_design_dfs()to choose between unified vs legacy parsing. - Introduce
_parse_unified_design()to translate the unified flat table into the legacy(s_data_frame, f_table)interface. - Refactor the legacy parsing logic into
_parse_legacy_design().
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
quantmsutils/diann/diann2msstats.py
Outdated
| has_blank_line = "\n" in data | ||
| header_line = data[0].replace("\n", "") if data else "" | ||
| is_unified_format = "Condition" in header_line and "BioReplicate" in header_line and "Filename" in header_line | ||
|
|
||
| if is_unified_format and not has_blank_line: |
There was a problem hiding this comment.
Unified-format detection currently forces legacy parsing whenever any blank line exists in the file (has_blank_line = "\n" in data). This makes unified design parsing fragile (e.g., a trailing empty line will route to _parse_legacy_design and raise the legacy "blank separator row" error). Consider ignoring leading/trailing empty/whitespace-only lines for detection, or preferring unified parsing whenever the header matches unified columns regardless of extra blank lines.
| has_blank_line = "\n" in data | |
| header_line = data[0].replace("\n", "") if data else "" | |
| is_unified_format = "Condition" in header_line and "BioReplicate" in header_line and "Filename" in header_line | |
| if is_unified_format and not has_blank_line: | |
| header_line = data[0].replace("\n", "") if data else "" | |
| is_unified_format = "Condition" in header_line and "BioReplicate" in header_line and "Filename" in header_line | |
| # Prefer unified parsing whenever the header matches unified columns, | |
| # regardless of extra blank/whitespace-only lines elsewhere in the file. | |
| if is_unified_format: |
quantmsutils/diann/diann2msstats.py
Outdated
| # Build s_data_frame (sample table): unique samples with condition/bioreplicate | ||
| s_data_frame = ( | ||
| df[["Sample", "Condition", "BioReplicate"]] | ||
| .drop_duplicates(subset=["Sample"]) | ||
| .rename(columns={"Condition": "MSstats_Condition", "BioReplicate": "MSstats_BioReplicate"}) |
There was a problem hiding this comment.
_parse_unified_design drops duplicates on Sample only. If the input contains the same Sample with different Condition/BioReplicate across rows (e.g., due to a design mistake), this will silently pick the first row and produce incorrect mapping. It would be safer to assert that each Sample maps to exactly one (Condition, BioReplicate) pair and raise a clear ValueError when conflicts are present.
| # Build s_data_frame (sample table): unique samples with condition/bioreplicate | |
| s_data_frame = ( | |
| df[["Sample", "Condition", "BioReplicate"]] | |
| .drop_duplicates(subset=["Sample"]) | |
| .rename(columns={"Condition": "MSstats_Condition", "BioReplicate": "MSstats_BioReplicate"}) | |
| # Build s_data_frame (sample table): ensure each Sample maps to exactly one | |
| # (Condition, BioReplicate) pair, and raise if conflicting mappings are present. | |
| unique_mapping = df[["Sample", "Condition", "BioReplicate"]].drop_duplicates() | |
| # Detect samples that are associated with multiple distinct (Condition, BioReplicate) pairs. | |
| duplicated_samples = unique_mapping["Sample"][unique_mapping["Sample"].duplicated(keep=False)].unique() | |
| if len(duplicated_samples) > 0: | |
| conflict_list = ", ".join(map(str, duplicated_samples)) | |
| raise ValueError( | |
| "Inconsistent experimental design in unified format: the following Sample(s) map to " | |
| "multiple (Condition, BioReplicate) combinations: " | |
| f"{conflict_list}. Each Sample must have exactly one Condition and one BioReplicate." | |
| ) | |
| s_data_frame = unique_mapping.rename( | |
| columns={"Condition": "MSstats_Condition", "BioReplicate": "MSstats_BioReplicate"} |
quantmsutils/diann/diann2msstats.py
Outdated
| if is_unified_format and not has_blank_line: | ||
| return _parse_unified_design(exp_design_file) |
There was a problem hiding this comment.
The new unified design parsing path (_parse_unified_design) isn’t covered by tests. The existing CLI test uses the legacy two-table design file, so regressions in unified parsing (column expectations, run derivation, merging) wouldn’t be caught. Please add a test fixture for a unified flat design TSV and a corresponding test that runs diann2msstats end-to-end.
| df = pd.read_csv(exp_design_file, sep="\t") | ||
| df["run"] = df["Filename"].apply(lambda x: _true_stem(x)) | ||
|
|
||
| # Build f_table (file table): one row per file/channel | ||
| f_table = df[["Filename", "Fraction", "Sample", "run"]].copy() | ||
|
|
There was a problem hiding this comment.
_parse_unified_design assumes required columns exist (Filename, Fraction, Sample, Condition, BioReplicate) and will raise a KeyError if any are missing/misspelled. Since this function is part of CLI input handling, consider validating the required columns up front and raising a ValueError with a helpful message listing missing columns and the expected header.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
quantmsutils/diann/diann2msstats.py (2)
119-128: Format detection doesn't validate all required columns.Line 123 checks for
Condition,BioReplicate, andFilename, but_parse_unified_design()also requiresFractionandSamplecolumns (lines 142, 146). If a file passes the unified format check but lacks these columns, aKeyErrorwill be raised with a less informative message.Additionally, the substring check (e.g.,
"Condition" in header_line) could match unintended column names like"MyCondition". Consider splitting the header and checking exact column names.♻️ Suggested improvement
# Auto-detect format: new unified flat table (from convert-diann) vs legacy two-table has_blank_line = "\n" in data header_line = data[0].replace("\n", "") if data else "" - is_unified_format = "Condition" in header_line and "BioReplicate" in header_line and "Filename" in header_line + header_cols = set(header_line.split("\t")) + unified_required = {"Condition", "BioReplicate", "Filename", "Fraction", "Sample"} + is_unified_format = unified_required.issubset(header_cols) if is_unified_format and not has_blank_line: return _parse_unified_design(exp_design_file)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@quantmsutils/diann/diann2msstats.py` around lines 119 - 128, The unified-format detection logic (is_unified_format) only substring-checks header_line for "Condition", "BioReplicate", and "Filename" and misses required columns ("Fraction" and "Sample"), which can cause obscure KeyError in _parse_unified_design; fix by splitting header_line into exact column names (e.g., cols = header_line.split(...) or csv parsing) and verify that the full required set {"Condition","BioReplicate","Filename","Fraction","Sample"} is a subset of cols before calling _parse_unified_design; if any required column is missing, raise a clear exception or fall back to _parse_legacy_design with a descriptive error message so callers know which columns are absent.
145-149:drop_duplicatessilently ignores inconsistent sample metadata.If the same
Sampleappears with differentConditionorBioReplicatevalues across rows (e.g., due to a typo in the input file),drop_duplicates(subset=["Sample"])keeps only the first row, silently discarding conflicting data. Consider validating that each sample has consistent metadata before de-duplication.🛡️ Optional validation before de-duplication
# Build s_data_frame (sample table): unique samples with condition/bioreplicate + sample_meta = df[["Sample", "Condition", "BioReplicate"]].drop_duplicates() + if sample_meta["Sample"].duplicated().any(): + inconsistent = sample_meta[sample_meta["Sample"].duplicated(keep=False)] + logger.warning( + "Sample(s) have inconsistent Condition/BioReplicate values: %s. " + "Using first occurrence.", + inconsistent["Sample"].unique().tolist(), + ) s_data_frame = ( df[["Sample", "Condition", "BioReplicate"]] .drop_duplicates(subset=["Sample"]) .rename(columns={"Condition": "MSstats_Condition", "BioReplicate": "MSstats_BioReplicate"}) )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@quantmsutils/diann/diann2msstats.py` around lines 145 - 149, The current s_data_frame creation uses df[["Sample","Condition","BioReplicate"]].drop_duplicates(subset=["Sample"]) which silently keeps the first row when a Sample has conflicting Condition or BioReplicate values; before calling drop_duplicates, validate consistency by grouping df by "Sample" (use the existing df) and checking that each group's "Condition" and "BioReplicate" have a single unique value (nunique == 1); if any sample has >1 unique value for either field, raise a clear exception or log an error listing the offending Sample IDs so the input can be corrected, otherwise proceed to rename and drop_duplicates as currently done to produce s_data_frame.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@quantmsutils/diann/diann2msstats.py`:
- Around line 156-162: The except block that replaces the ValueError from
data.index("\n") should suppress exception chaining; modify the raise in the
except to use "raise ValueError(... ) from None" so the new descriptive error
about exp_design_file and the missing blank separator row is raised without
attaching the original ValueError; update the code around the data.index call
and the empty_row handling (variable empty_row, function/module diann2msstats)
accordingly.
---
Nitpick comments:
In `@quantmsutils/diann/diann2msstats.py`:
- Around line 119-128: The unified-format detection logic (is_unified_format)
only substring-checks header_line for "Condition", "BioReplicate", and
"Filename" and misses required columns ("Fraction" and "Sample"), which can
cause obscure KeyError in _parse_unified_design; fix by splitting header_line
into exact column names (e.g., cols = header_line.split(...) or csv parsing) and
verify that the full required set
{"Condition","BioReplicate","Filename","Fraction","Sample"} is a subset of cols
before calling _parse_unified_design; if any required column is missing, raise a
clear exception or fall back to _parse_legacy_design with a descriptive error
message so callers know which columns are absent.
- Around line 145-149: The current s_data_frame creation uses
df[["Sample","Condition","BioReplicate"]].drop_duplicates(subset=["Sample"])
which silently keeps the first row when a Sample has conflicting Condition or
BioReplicate values; before calling drop_duplicates, validate consistency by
grouping df by "Sample" (use the existing df) and checking that each group's
"Condition" and "BioReplicate" have a single unique value (nunique == 1); if any
sample has >1 unique value for either field, raise a clear exception or log an
error listing the offending Sample IDs so the input can be corrected, otherwise
proceed to rename and drop_duplicates as currently done to produce s_data_frame.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a2378c6a-5b5d-4709-8f9e-3b40a15eb3c1
📒 Files selected for processing (1)
quantmsutils/diann/diann2msstats.py
Summary by CodeRabbit
Improvements
Bug Fixes / Validation