Changes DIANN convert to MSStats by ypriverol · Pull Request #67 · bigbio/quantms-utils

ypriverol · 2026-03-21T06:14:51Z

Summary by CodeRabbit

Improvements
- Parser now auto-detects and supports a unified experimental-design tabular format with standardized columns, while continuing to accept the legacy two-table layout.
- Seamless handling of both formats preserves backward compatibility with existing workflows.
Bug Fixes / Validation
- Clearer validation: reports an explicit error if a sample maps to conflicting condition/bioreplicate assignments; legacy format still enforces its expected separator.

coderabbitai · 2026-03-21T06:15:14Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7c4af229-7e0c-4c79-8228-86d0bbd16851

📥 Commits

Reviewing files that changed from the base of the PR and between 2f3118e and 49a5565.

📒 Files selected for processing (1)

quantmsutils/diann/diann2msstats.py

🚧 Files skipped from review as they are similar to previous changes (1)

quantmsutils/diann/diann2msstats.py

📝 Walkthrough

Walkthrough

get_exp_design_dfs() in quantmsutils/diann/diann2msstats.py now auto-detects a unified experimental design by inspecting the header for Condition, BioReplicate, and Filename. If detected it parses the single-table unified format; otherwise it falls back to the legacy blank-line-separated two-table parsing. Legacy logic was extracted to _parse_legacy_design() and unified logic implemented in _parse_unified_design().

Changes

Cohort / File(s)	Summary
Design parsing & helpers `quantmsutils/diann/diann2msstats.py`	Added header-based auto-detection in `get_exp_design_dfs()` to choose between unified and legacy formats. Introduced `_parse_unified_design()` to read a single tab-separated table, validate required columns (`Filename`, `Fraction`, `Sample`, `Condition`, `BioReplicate`), derive `run` from `Filename` stem, build `f_table` and deduplicated `s_data_frame` (renaming `Condition→MSstats_Condition`, `BioReplicate→MSstats_BioReplicate`), and error if a `Sample` maps to multiple `(Condition, BioReplicate)` pairs. Extracted prior blank-line-separated parsing to `_parse_legacy_design()` preserving previous behavior and erroring when the separator row is missing. No public function signatures changed.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I nibble headers, sniff the Filename line,
Unified or legacy — I choose just fine.
I stitch runs from stems and tidy each sample,
Hop, parse, and dedupe — then give a small scrample.
Tiny paws, big fixes, the design files shine.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'Changes DIANN convert to MSStats' is vague and generic, lacking specific detail about what changes were made.	Revise the title to be more specific about the primary change, such as 'Add unified experimental design format support to DIANN to MSStats converter' or 'Support auto-detection of unified experimental design format in DIANN2MSStats'.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

CodeRabbit can use OpenGrep to find security vulnerabilities and bugs across 17+ programming languages.

OpenGrep is compatible with Semgrep configurations. Add an opengrep.yml or semgrep.yml configuration file to your project to enable OpenGrep analysis.

Copilot

Pull request overview

This PR updates the DIA-NN → MSstats converter to accept a newer “unified” (single-table) experimental design TSV in addition to the existing legacy two-table format (blank-line separated), by auto-detecting the format and routing to the appropriate parser.

Changes:

Add format auto-detection in get_exp_design_dfs() to choose between unified vs legacy parsing.
Introduce _parse_unified_design() to translate the unified flat table into the legacy (s_data_frame, f_table) interface.
Refactor the legacy parsing logic into _parse_legacy_design().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-21T06:17:49Z

quantmsutils/diann/diann2msstats.py

+    has_blank_line = "\n" in data
+    header_line = data[0].replace("\n", "") if data else ""
+    is_unified_format = "Condition" in header_line and "BioReplicate" in header_line and "Filename" in header_line
+
+    if is_unified_format and not has_blank_line:


Unified-format detection currently forces legacy parsing whenever any blank line exists in the file (has_blank_line = "\n" in data). This makes unified design parsing fragile (e.g., a trailing empty line will route to _parse_legacy_design and raise the legacy "blank separator row" error). Consider ignoring leading/trailing empty/whitespace-only lines for detection, or preferring unified parsing whenever the header matches unified columns regardless of extra blank lines.

Suggested change

has_blank_line = "\n" in data

header_line = data[0].replace("\n", "") if data else ""

is_unified_format = "Condition" in header_line and "BioReplicate" in header_line and "Filename" in header_line

if is_unified_format and not has_blank_line:

header_line = data[0].replace("\n", "") if data else ""

is_unified_format = "Condition" in header_line and "BioReplicate" in header_line and "Filename" in header_line

# Prefer unified parsing whenever the header matches unified columns,

# regardless of extra blank/whitespace-only lines elsewhere in the file.

if is_unified_format:

Copilot · 2026-03-21T06:17:49Z

quantmsutils/diann/diann2msstats.py

+    # Build s_data_frame (sample table): unique samples with condition/bioreplicate
+    s_data_frame = (
+        df[["Sample", "Condition", "BioReplicate"]]
+        .drop_duplicates(subset=["Sample"])
+        .rename(columns={"Condition": "MSstats_Condition", "BioReplicate": "MSstats_BioReplicate"})


_parse_unified_design drops duplicates on Sample only. If the input contains the same Sample with different Condition/BioReplicate across rows (e.g., due to a design mistake), this will silently pick the first row and produce incorrect mapping. It would be safer to assert that each Sample maps to exactly one (Condition, BioReplicate) pair and raise a clear ValueError when conflicts are present.

Suggested change

# Build s_data_frame (sample table): unique samples with condition/bioreplicate

s_data_frame = (

df[["Sample", "Condition", "BioReplicate"]]

.drop_duplicates(subset=["Sample"])

.rename(columns={"Condition": "MSstats_Condition", "BioReplicate": "MSstats_BioReplicate"})

# Build s_data_frame (sample table): ensure each Sample maps to exactly one

# (Condition, BioReplicate) pair, and raise if conflicting mappings are present.

unique_mapping = df[["Sample", "Condition", "BioReplicate"]].drop_duplicates()

# Detect samples that are associated with multiple distinct (Condition, BioReplicate) pairs.

duplicated_samples = unique_mapping["Sample"][unique_mapping["Sample"].duplicated(keep=False)].unique()

if len(duplicated_samples) > 0:

conflict_list = ", ".join(map(str, duplicated_samples))

raise ValueError(

"Inconsistent experimental design in unified format: the following Sample(s) map to "

"multiple (Condition, BioReplicate) combinations: "

f"{conflict_list}. Each Sample must have exactly one Condition and one BioReplicate."

)

s_data_frame = unique_mapping.rename(

columns={"Condition": "MSstats_Condition", "BioReplicate": "MSstats_BioReplicate"}

Copilot · 2026-03-21T06:17:50Z

quantmsutils/diann/diann2msstats.py

+    if is_unified_format and not has_blank_line:
+        return _parse_unified_design(exp_design_file)


The new unified design parsing path (_parse_unified_design) isn’t covered by tests. The existing CLI test uses the legacy two-table design file, so regressions in unified parsing (column expectations, run derivation, merging) wouldn’t be caught. Please add a test fixture for a unified flat design TSV and a corresponding test that runs diann2msstats end-to-end.

Copilot · 2026-03-21T06:17:50Z

quantmsutils/diann/diann2msstats.py

+    df = pd.read_csv(exp_design_file, sep="\t")
+    df["run"] = df["Filename"].apply(lambda x: _true_stem(x))
+
+    # Build f_table (file table): one row per file/channel
+    f_table = df[["Filename", "Fraction", "Sample", "run"]].copy()
+


_parse_unified_design assumes required columns exist (Filename, Fraction, Sample, Condition, BioReplicate) and will raise a KeyError if any are missing/misspelled. Since this function is part of CLI input handling, consider validating the required columns up front and raising a ValueError with a helpful message listing missing columns and the expected header.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

quantmsutils/diann/diann2msstats.py (2)

119-128: Format detection doesn't validate all required columns.

Line 123 checks for Condition, BioReplicate, and Filename, but _parse_unified_design() also requires Fraction and Sample columns (lines 142, 146). If a file passes the unified format check but lacks these columns, a KeyError will be raised with a less informative message.

Additionally, the substring check (e.g., "Condition" in header_line) could match unintended column names like "MyCondition". Consider splitting the header and checking exact column names.

♻️ Suggested improvement

     # Auto-detect format: new unified flat table (from convert-diann) vs legacy two-table
     has_blank_line = "\n" in data
     header_line = data[0].replace("\n", "") if data else ""
-    is_unified_format = "Condition" in header_line and "BioReplicate" in header_line and "Filename" in header_line
+    header_cols = set(header_line.split("\t"))
+    unified_required = {"Condition", "BioReplicate", "Filename", "Fraction", "Sample"}
+    is_unified_format = unified_required.issubset(header_cols)
 
     if is_unified_format and not has_blank_line:
         return _parse_unified_design(exp_design_file)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@quantmsutils/diann/diann2msstats.py` around lines 119 - 128, The
unified-format detection logic (is_unified_format) only substring-checks
header_line for "Condition", "BioReplicate", and "Filename" and misses required
columns ("Fraction" and "Sample"), which can cause obscure KeyError in
_parse_unified_design; fix by splitting header_line into exact column names
(e.g., cols = header_line.split(...) or csv parsing) and verify that the full
required set {"Condition","BioReplicate","Filename","Fraction","Sample"} is a
subset of cols before calling _parse_unified_design; if any required column is
missing, raise a clear exception or fall back to _parse_legacy_design with a
descriptive error message so callers know which columns are absent.

145-149: drop_duplicates silently ignores inconsistent sample metadata.

If the same Sample appears with different Condition or BioReplicate values across rows (e.g., due to a typo in the input file), drop_duplicates(subset=["Sample"]) keeps only the first row, silently discarding conflicting data. Consider validating that each sample has consistent metadata before de-duplication.

🛡️ Optional validation before de-duplication

     # Build s_data_frame (sample table): unique samples with condition/bioreplicate
+    sample_meta = df[["Sample", "Condition", "BioReplicate"]].drop_duplicates()
+    if sample_meta["Sample"].duplicated().any():
+        inconsistent = sample_meta[sample_meta["Sample"].duplicated(keep=False)]
+        logger.warning(
+            "Sample(s) have inconsistent Condition/BioReplicate values: %s. "
+            "Using first occurrence.",
+            inconsistent["Sample"].unique().tolist(),
+        )
     s_data_frame = (
         df[["Sample", "Condition", "BioReplicate"]]
         .drop_duplicates(subset=["Sample"])
         .rename(columns={"Condition": "MSstats_Condition", "BioReplicate": "MSstats_BioReplicate"})
     )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@quantmsutils/diann/diann2msstats.py` around lines 145 - 149, The current
s_data_frame creation uses
df[["Sample","Condition","BioReplicate"]].drop_duplicates(subset=["Sample"])
which silently keeps the first row when a Sample has conflicting Condition or
BioReplicate values; before calling drop_duplicates, validate consistency by
grouping df by "Sample" (use the existing df) and checking that each group's
"Condition" and "BioReplicate" have a single unique value (nunique == 1); if any
sample has >1 unique value for either field, raise a clear exception or log an
error listing the offending Sample IDs so the input can be corrected, otherwise
proceed to rename and drop_duplicates as currently done to produce s_data_frame.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@quantmsutils/diann/diann2msstats.py`:
- Around line 156-162: The except block that replaces the ValueError from
data.index("\n") should suppress exception chaining; modify the raise in the
except to use "raise ValueError(... ) from None" so the new descriptive error
about exp_design_file and the missing blank separator row is raised without
attaching the original ValueError; update the code around the data.index call
and the empty_row handling (variable empty_row, function/module diann2msstats)
accordingly.

---

Nitpick comments:
In `@quantmsutils/diann/diann2msstats.py`:
- Around line 119-128: The unified-format detection logic (is_unified_format)
only substring-checks header_line for "Condition", "BioReplicate", and
"Filename" and misses required columns ("Fraction" and "Sample"), which can
cause obscure KeyError in _parse_unified_design; fix by splitting header_line
into exact column names (e.g., cols = header_line.split(...) or csv parsing) and
verify that the full required set
{"Condition","BioReplicate","Filename","Fraction","Sample"} is a subset of cols
before calling _parse_unified_design; if any required column is missing, raise a
clear exception or fall back to _parse_legacy_design with a descriptive error
message so callers know which columns are absent.
- Around line 145-149: The current s_data_frame creation uses
df[["Sample","Condition","BioReplicate"]].drop_duplicates(subset=["Sample"])
which silently keeps the first row when a Sample has conflicting Condition or
BioReplicate values; before calling drop_duplicates, validate consistency by
grouping df by "Sample" (use the existing df) and checking that each group's
"Condition" and "BioReplicate" have a single unique value (nunique == 1); if any
sample has >1 unique value for either field, raise a clear exception or log an
error listing the offending Sample IDs so the input can be corrected, otherwise
proceed to rename and drop_duplicates as currently done to produce s_data_frame.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a2378c6a-5b5d-4709-8f9e-3b40a15eb3c1

📥 Commits

Reviewing files that changed from the base of the PR and between 43ccb4a and 2f3118e.

📒 Files selected for processing (1)

quantmsutils/diann/diann2msstats.py

quantmsutils/diann/diann2msstats.py

ypriverol added 2 commits March 20, 2026 07:34

converter to msstats from DIANN

3c948d4

Merge branch 'main' of https://github.com/bigbio/quantms-utils into dev

2f3118e

ypriverol requested a review from Copilot March 21, 2026 06:15

Copilot started reviewing on behalf of ypriverol March 21, 2026 06:15 View session

Copilot AI reviewed Mar 21, 2026

View reviewed changes

coderabbitai bot reviewed Mar 21, 2026

View reviewed changes

quantmsutils/diann/diann2msstats.py Outdated Show resolved Hide resolved

minor changes

49a5565

ypriverol merged commit 8632dfb into main Mar 21, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes DIANN convert to MSStats #67

Changes DIANN convert to MSStats #67
ypriverol merged 3 commits intomainfrom
dev

ypriverol commented Mar 21, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 21, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-    # Build s_data_frame (sample table): unique samples with condition/bioreplicate
-    s_data_frame = (
-        df[["Sample", "Condition", "BioReplicate"]]
-        .drop_duplicates(subset=["Sample"])
-        .rename(columns={"Condition": "MSstats_Condition", "BioReplicate": "MSstats_BioReplicate"})
+    # Build s_data_frame (sample table): ensure each Sample maps to exactly one
+    # (Condition, BioReplicate) pair, and raise if conflicting mappings are present.
+    unique_mapping = df[["Sample", "Condition", "BioReplicate"]].drop_duplicates()
+    # Detect samples that are associated with multiple distinct (Condition, BioReplicate) pairs.
+    duplicated_samples = unique_mapping["Sample"][unique_mapping["Sample"].duplicated(keep=False)].unique()
+    if len(duplicated_samples) > 0:
+        conflict_list = ", ".join(map(str, duplicated_samples))
+        raise ValueError(
+            "Inconsistent experimental design in unified format: the following Sample(s) map to "
+            "multiple (Condition, BioReplicate) combinations: "
+            f"{conflict_list}. Each Sample must have exactly one Condition and one BioReplicate."
+        )
+    s_data_frame = unique_mapping.rename(
+        columns={"Condition": "MSstats_Condition", "BioReplicate": "MSstats_BioReplicate"}

		if is_unified_format and not has_blank_line:
		return _parse_unified_design(exp_design_file)

Conversation

ypriverol commented Mar 21, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ypriverol commented Mar 21, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 21, 2026 •

edited

Loading