Skip to content

Changes DIANN convert to MSStats #67

Merged
ypriverol merged 3 commits intomainfrom
dev
Mar 21, 2026
Merged

Changes DIANN convert to MSStats #67
ypriverol merged 3 commits intomainfrom
dev

Conversation

@ypriverol
Copy link
Member

@ypriverol ypriverol commented Mar 21, 2026

Summary by CodeRabbit

  • Improvements

    • Parser now auto-detects and supports a unified experimental-design tabular format with standardized columns, while continuing to accept the legacy two-table layout.
    • Seamless handling of both formats preserves backward compatibility with existing workflows.
  • Bug Fixes / Validation

    • Clearer validation: reports an explicit error if a sample maps to conflicting condition/bioreplicate assignments; legacy format still enforces its expected separator.

@ypriverol ypriverol requested a review from Copilot March 21, 2026 06:15
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 21, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7c4af229-7e0c-4c79-8228-86d0bbd16851

📥 Commits

Reviewing files that changed from the base of the PR and between 2f3118e and 49a5565.

📒 Files selected for processing (1)
  • quantmsutils/diann/diann2msstats.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • quantmsutils/diann/diann2msstats.py

📝 Walkthrough

Walkthrough

get_exp_design_dfs() in quantmsutils/diann/diann2msstats.py now auto-detects a unified experimental design by inspecting the header for Condition, BioReplicate, and Filename. If detected it parses the single-table unified format; otherwise it falls back to the legacy blank-line-separated two-table parsing. Legacy logic was extracted to _parse_legacy_design() and unified logic implemented in _parse_unified_design().

Changes

Cohort / File(s) Summary
Design parsing & helpers
quantmsutils/diann/diann2msstats.py
Added header-based auto-detection in get_exp_design_dfs() to choose between unified and legacy formats. Introduced _parse_unified_design() to read a single tab-separated table, validate required columns (Filename, Fraction, Sample, Condition, BioReplicate), derive run from Filename stem, build f_table and deduplicated s_data_frame (renaming Condition→MSstats_Condition, BioReplicate→MSstats_BioReplicate), and error if a Sample maps to multiple (Condition, BioReplicate) pairs. Extracted prior blank-line-separated parsing to _parse_legacy_design() preserving previous behavior and erroring when the separator row is missing. No public function signatures changed.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I nibble headers, sniff the Filename line,
Unified or legacy — I choose just fine.
I stitch runs from stems and tidy each sample,
Hop, parse, and dedupe — then give a small scrample.
Tiny paws, big fixes, the design files shine.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Changes DIANN convert to MSStats' is vague and generic, lacking specific detail about what changes were made. Revise the title to be more specific about the primary change, such as 'Add unified experimental design format support to DIANN to MSStats converter' or 'Support auto-detection of unified experimental design format in DIANN2MSStats'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can use OpenGrep to find security vulnerabilities and bugs across 17+ programming languages.

OpenGrep is compatible with Semgrep configurations. Add an opengrep.yml or semgrep.yml configuration file to your project to enable OpenGrep analysis.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the DIA-NN → MSstats converter to accept a newer “unified” (single-table) experimental design TSV in addition to the existing legacy two-table format (blank-line separated), by auto-detecting the format and routing to the appropriate parser.

Changes:

  • Add format auto-detection in get_exp_design_dfs() to choose between unified vs legacy parsing.
  • Introduce _parse_unified_design() to translate the unified flat table into the legacy (s_data_frame, f_table) interface.
  • Refactor the legacy parsing logic into _parse_legacy_design().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +121 to +125
has_blank_line = "\n" in data
header_line = data[0].replace("\n", "") if data else ""
is_unified_format = "Condition" in header_line and "BioReplicate" in header_line and "Filename" in header_line

if is_unified_format and not has_blank_line:
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unified-format detection currently forces legacy parsing whenever any blank line exists in the file (has_blank_line = "\n" in data). This makes unified design parsing fragile (e.g., a trailing empty line will route to _parse_legacy_design and raise the legacy "blank separator row" error). Consider ignoring leading/trailing empty/whitespace-only lines for detection, or preferring unified parsing whenever the header matches unified columns regardless of extra blank lines.

Suggested change
has_blank_line = "\n" in data
header_line = data[0].replace("\n", "") if data else ""
is_unified_format = "Condition" in header_line and "BioReplicate" in header_line and "Filename" in header_line
if is_unified_format and not has_blank_line:
header_line = data[0].replace("\n", "") if data else ""
is_unified_format = "Condition" in header_line and "BioReplicate" in header_line and "Filename" in header_line
# Prefer unified parsing whenever the header matches unified columns,
# regardless of extra blank/whitespace-only lines elsewhere in the file.
if is_unified_format:

Copilot uses AI. Check for mistakes.
Comment on lines +144 to +148
# Build s_data_frame (sample table): unique samples with condition/bioreplicate
s_data_frame = (
df[["Sample", "Condition", "BioReplicate"]]
.drop_duplicates(subset=["Sample"])
.rename(columns={"Condition": "MSstats_Condition", "BioReplicate": "MSstats_BioReplicate"})
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_parse_unified_design drops duplicates on Sample only. If the input contains the same Sample with different Condition/BioReplicate across rows (e.g., due to a design mistake), this will silently pick the first row and produce incorrect mapping. It would be safer to assert that each Sample maps to exactly one (Condition, BioReplicate) pair and raise a clear ValueError when conflicts are present.

Suggested change
# Build s_data_frame (sample table): unique samples with condition/bioreplicate
s_data_frame = (
df[["Sample", "Condition", "BioReplicate"]]
.drop_duplicates(subset=["Sample"])
.rename(columns={"Condition": "MSstats_Condition", "BioReplicate": "MSstats_BioReplicate"})
# Build s_data_frame (sample table): ensure each Sample maps to exactly one
# (Condition, BioReplicate) pair, and raise if conflicting mappings are present.
unique_mapping = df[["Sample", "Condition", "BioReplicate"]].drop_duplicates()
# Detect samples that are associated with multiple distinct (Condition, BioReplicate) pairs.
duplicated_samples = unique_mapping["Sample"][unique_mapping["Sample"].duplicated(keep=False)].unique()
if len(duplicated_samples) > 0:
conflict_list = ", ".join(map(str, duplicated_samples))
raise ValueError(
"Inconsistent experimental design in unified format: the following Sample(s) map to "
"multiple (Condition, BioReplicate) combinations: "
f"{conflict_list}. Each Sample must have exactly one Condition and one BioReplicate."
)
s_data_frame = unique_mapping.rename(
columns={"Condition": "MSstats_Condition", "BioReplicate": "MSstats_BioReplicate"}

Copilot uses AI. Check for mistakes.
Comment on lines +125 to +126
if is_unified_format and not has_blank_line:
return _parse_unified_design(exp_design_file)
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new unified design parsing path (_parse_unified_design) isn’t covered by tests. The existing CLI test uses the legacy two-table design file, so regressions in unified parsing (column expectations, run derivation, merging) wouldn’t be caught. Please add a test fixture for a unified flat design TSV and a corresponding test that runs diann2msstats end-to-end.

Copilot uses AI. Check for mistakes.
Comment on lines +138 to +143
df = pd.read_csv(exp_design_file, sep="\t")
df["run"] = df["Filename"].apply(lambda x: _true_stem(x))

# Build f_table (file table): one row per file/channel
f_table = df[["Filename", "Fraction", "Sample", "run"]].copy()

Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_parse_unified_design assumes required columns exist (Filename, Fraction, Sample, Condition, BioReplicate) and will raise a KeyError if any are missing/misspelled. Since this function is part of CLI input handling, consider validating the required columns up front and raising a ValueError with a helpful message listing missing columns and the expected header.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
quantmsutils/diann/diann2msstats.py (2)

119-128: Format detection doesn't validate all required columns.

Line 123 checks for Condition, BioReplicate, and Filename, but _parse_unified_design() also requires Fraction and Sample columns (lines 142, 146). If a file passes the unified format check but lacks these columns, a KeyError will be raised with a less informative message.

Additionally, the substring check (e.g., "Condition" in header_line) could match unintended column names like "MyCondition". Consider splitting the header and checking exact column names.

♻️ Suggested improvement
     # Auto-detect format: new unified flat table (from convert-diann) vs legacy two-table
     has_blank_line = "\n" in data
     header_line = data[0].replace("\n", "") if data else ""
-    is_unified_format = "Condition" in header_line and "BioReplicate" in header_line and "Filename" in header_line
+    header_cols = set(header_line.split("\t"))
+    unified_required = {"Condition", "BioReplicate", "Filename", "Fraction", "Sample"}
+    is_unified_format = unified_required.issubset(header_cols)
 
     if is_unified_format and not has_blank_line:
         return _parse_unified_design(exp_design_file)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@quantmsutils/diann/diann2msstats.py` around lines 119 - 128, The
unified-format detection logic (is_unified_format) only substring-checks
header_line for "Condition", "BioReplicate", and "Filename" and misses required
columns ("Fraction" and "Sample"), which can cause obscure KeyError in
_parse_unified_design; fix by splitting header_line into exact column names
(e.g., cols = header_line.split(...) or csv parsing) and verify that the full
required set {"Condition","BioReplicate","Filename","Fraction","Sample"} is a
subset of cols before calling _parse_unified_design; if any required column is
missing, raise a clear exception or fall back to _parse_legacy_design with a
descriptive error message so callers know which columns are absent.

145-149: drop_duplicates silently ignores inconsistent sample metadata.

If the same Sample appears with different Condition or BioReplicate values across rows (e.g., due to a typo in the input file), drop_duplicates(subset=["Sample"]) keeps only the first row, silently discarding conflicting data. Consider validating that each sample has consistent metadata before de-duplication.

🛡️ Optional validation before de-duplication
     # Build s_data_frame (sample table): unique samples with condition/bioreplicate
+    sample_meta = df[["Sample", "Condition", "BioReplicate"]].drop_duplicates()
+    if sample_meta["Sample"].duplicated().any():
+        inconsistent = sample_meta[sample_meta["Sample"].duplicated(keep=False)]
+        logger.warning(
+            "Sample(s) have inconsistent Condition/BioReplicate values: %s. "
+            "Using first occurrence.",
+            inconsistent["Sample"].unique().tolist(),
+        )
     s_data_frame = (
         df[["Sample", "Condition", "BioReplicate"]]
         .drop_duplicates(subset=["Sample"])
         .rename(columns={"Condition": "MSstats_Condition", "BioReplicate": "MSstats_BioReplicate"})
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@quantmsutils/diann/diann2msstats.py` around lines 145 - 149, The current
s_data_frame creation uses
df[["Sample","Condition","BioReplicate"]].drop_duplicates(subset=["Sample"])
which silently keeps the first row when a Sample has conflicting Condition or
BioReplicate values; before calling drop_duplicates, validate consistency by
grouping df by "Sample" (use the existing df) and checking that each group's
"Condition" and "BioReplicate" have a single unique value (nunique == 1); if any
sample has >1 unique value for either field, raise a clear exception or log an
error listing the offending Sample IDs so the input can be corrected, otherwise
proceed to rename and drop_duplicates as currently done to produce s_data_frame.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@quantmsutils/diann/diann2msstats.py`:
- Around line 156-162: The except block that replaces the ValueError from
data.index("\n") should suppress exception chaining; modify the raise in the
except to use "raise ValueError(... ) from None" so the new descriptive error
about exp_design_file and the missing blank separator row is raised without
attaching the original ValueError; update the code around the data.index call
and the empty_row handling (variable empty_row, function/module diann2msstats)
accordingly.

---

Nitpick comments:
In `@quantmsutils/diann/diann2msstats.py`:
- Around line 119-128: The unified-format detection logic (is_unified_format)
only substring-checks header_line for "Condition", "BioReplicate", and
"Filename" and misses required columns ("Fraction" and "Sample"), which can
cause obscure KeyError in _parse_unified_design; fix by splitting header_line
into exact column names (e.g., cols = header_line.split(...) or csv parsing) and
verify that the full required set
{"Condition","BioReplicate","Filename","Fraction","Sample"} is a subset of cols
before calling _parse_unified_design; if any required column is missing, raise a
clear exception or fall back to _parse_legacy_design with a descriptive error
message so callers know which columns are absent.
- Around line 145-149: The current s_data_frame creation uses
df[["Sample","Condition","BioReplicate"]].drop_duplicates(subset=["Sample"])
which silently keeps the first row when a Sample has conflicting Condition or
BioReplicate values; before calling drop_duplicates, validate consistency by
grouping df by "Sample" (use the existing df) and checking that each group's
"Condition" and "BioReplicate" have a single unique value (nunique == 1); if any
sample has >1 unique value for either field, raise a clear exception or log an
error listing the offending Sample IDs so the input can be corrected, otherwise
proceed to rename and drop_duplicates as currently done to produce s_data_frame.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a2378c6a-5b5d-4709-8f9e-3b40a15eb3c1

📥 Commits

Reviewing files that changed from the base of the PR and between 43ccb4a and 2f3118e.

📒 Files selected for processing (1)
  • quantmsutils/diann/diann2msstats.py

@ypriverol ypriverol merged commit 8632dfb into main Mar 21, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants