feat: add input nodes as evaluation baseline by ntalluri · Pull Request #372 · Reed-CompBio/spras

ntalluri · 2025-08-14T20:23:36Z

Working on #370

read-the-docs-community · 2025-08-14T20:24:36Z

Documentation build overview

📚 spras | 🛠️ Build #29331954 | 📁 Comparing a57ef19 against latest (2857eb1)

🔍 Preview build

Show files changed (3 files in total): 📝 3 modified | ➕ 0 added | ➖ 0 deleted

File	Status
genindex.html	📝 modified
fordevs/modules.html	📝 modified
fordevs/spras.html	📝 modified

ntalluri · 2025-08-14T20:26:30Z

Current visualization on the egfr dataset

agitter · 2025-08-15T23:02:08Z

I thought this pull request was for something else. Could we rename it to be more like "feat: add input nodes as evaluation baseline"?

ntalluri · 2025-08-18T20:49:27Z

spras/evaluation.py

+                # TODO what if the node_ensemble is all frequency = 0.0, that will be the new source/target/prize/ baseline?
+
+                # Set frequency to 1.0 for matching nodes
+                prize_node_ensemble_df.loc[


make the baseline 1 baseline per dataset not per algorithm. All the sources/prizes/targest/active = 1.0 and everything else is 0. But still calculate the precision and recall.

this will be algorithm agnostic.

Another option also maybe needed is to only evaluate the internal nodes (set the source/target/prizes to 0.0).

we might need both of these. One will be a baseline on the current ensemble pr. New figures for all the evaluation for only evaluating the internal nodes.

…nd input nodes

… to the txt files

test/evaluate/test_evaluate.py

Snakefile

…of the functions itself

spras/evaluation.py

agitter

This is getting close.

Snakefile

agitter · 2025-08-22T22:23:33Z

spras/evaluation.py

+                # the Input_Nodes_Baseline PR curve highlights their overlap with the gold standard.
+                if prc_input_nodes_baseline_df is None:
+                    input_nodes_set = set(input_nodes['NODEID'])
+                    input_nodes_gold_intersection = input_nodes_set & gold_standard_nodes # TODO should this be all inputs nodes or the intersection with the gold standard for this baseline? I think it should be the intersection


Very good question. For a synthetic dataset like Panther pathways, the inputs are sampled from the pathway nodes that make up the gold standard so it doesn't matter.

For an omics input like EGFR, it matters substantially. What makes you prefer the intersection? I was inclined to say all input nodes because we cannot have a baseline algorithm that makes use of gold standard information as part of its ranking. I could create a valid pathway reconstruction algorithm that takes the input nodes and simply returns those. I can't use a gold standard in a valid pathway reconstruction algorithm, however.

In my opinion, the only input nodes that matter for evaluation are those that overlap with the gold standard. While it’s true that an algorithm could trivially return all input nodes, our precision recall evaluation is only defined with respect to the gold standard. Input nodes that aren’t in the gold standard don’t contribute to true positives, so including them in the baseline wouldn’t be meaningful.

That said, I also see the case for using all input nodes as it represents a valid baseline algorithm where an algorithm could simply return the given inputs without any reconstruction.

Maybe the difference is that the intersection provides an upper bound, while all input nodes provides a lower bound on what you could do without doing any reconstruction and we should provide both?

think through 3 baselines and explain why each of these can be used/needed

the no intersection, but input nodes by itself

the intersection of the gold standard and input nodes

the the gold standard by itself (which I think is what baseline is)

or do we want to make an ensemble of the gold standard as 1.0 and everything else 0.0 and do a PRC

Deciding this is the last point of feedback and last decision to finalize. Then I can do a last careful review and we should be ready to merge.

In our meeting, we discussed how options 1 and 2 will give the same recall. The only difference is that one will have some precision value and the other always has precision of 1.0.

spras/evaluation.py

test/evaluate/test_evaluate.py

Snakefile

agitter · 2025-10-03T22:21:59Z

@ntalluri is this waiting for my review or your updates? We haven't touched it in a while.

tristan-f-r · 2025-08-28T19:08:32Z

spras/evaluation.py

            raise ValueError(
-                f"Cannot compute PR curve or generate node ensemble. Input network for dataset \"{dataset_file.split('-')[0]}\" is empty."
+                f"Cannot compute PR curve or generate node ensemble. The input network is empty."
            )
        if node_table.empty:
            raise ValueError(
-                f"Cannot compute PR curve or generate node ensemble. Gold standard associated with dataset \"{dataset_file.split('-')[0]}\" is empty."
+                f"Cannot compute PR curve or generate node ensemble. The gold standard is empty."
            )


We should still supply the dataset name in this function to preserve the error information.

tristan-f-r · 2025-08-28T19:12:42Z

spras/evaluation.py

+                    # Dropping last elements because scikit-learn adds (1, 0) to precision/recall for plotting, not tied to real thresholds
+                    prc_input_nodes_baseline_data = {


(The comment should be moved up)

Suggested change

# Dropping last elements because scikit-learn adds (1, 0) to precision/recall for plotting, not tied to real thresholds

prc_input_nodes_baseline_data = {

# Dropping last elements because scikit-learn adds (1, 0) to precision/recall for plotting, not tied to real thresholds

# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html#sklearn.metrics.precision_recall_curve:~:text=Returns%3A-,precision,predictions%20with%20score%20%3E%3D%20thresholds%5Bi%5D%20and%20the%20last%20element%20is%200.,-thresholds

prc_input_nodes_baseline_data = {

tristan-f-r · 2025-08-28T22:48:06Z

spras/dataset.py

+    def get_interesting_input_nodes(self) -> pd.DataFrame:
+        """
+        Returns: a table listing the input nodes considered as starting points for pathway reconstruction algorithms,
+        restricted to nodes that have at least one of the specified attributes.


Suggested change

restricted to nodes that have at least one of the specified attributes.

or all of the nodes in this dataset that have at least one 'interesting' attribute as specified in-code.

tristan-f-r · 2025-08-28T22:48:50Z

spras/evaluation.py


        @param node_ensembles: dict of the pre-computed node_ensemble(s)
        @param node_table: gold standard nodes
+        @param input_nodes: the input nodes (sources, targets, prizes, actives) used for a specific dataset


Suggested change

@param input_nodes: the input nodes (sources, targets, prizes, actives) used for a specific dataset

@param input_nodes: the input nodes (usually from `Dataset#get_interesting_input_nodes`) used for a specific dataset

ntalluri · 2025-10-14T18:00:54Z

@ntalluri is this waiting for my review or your updates? We haven't touched it in a while.

@agitter I had a couple things to think about and fix for this PR; It hasn't been my main priority to finish yet.

precommit

968bd09

ntalluri changed the title ~~feat: adding new ensemble parameter tuning baseline~~ feat: add new ensemble parameter tuning baseline Aug 14, 2025

ntalluri closed this Aug 14, 2025

ntalluri reopened this Aug 14, 2025

tristan-f-r added the tuning Workflow-spanning algorithm tuning label Aug 15, 2025

ntalluri commented Aug 18, 2025

View reviewed changes

ntalluri changed the title ~~feat: add new ensemble parameter tuning baseline~~ feat: add input nodes as evaluation baseline Aug 19, 2025

ntalluri added analysis Analysis of PRA outputs needed for benchmarking Priority PRs needed for the benchmarking paper and removed tuning Workflow-spanning algorithm tuning labels Aug 19, 2025

ntalluri added 4 commits August 19, 2025 17:34

updating the baseline

7202a99

make it all input nodes, not the intersection between gold standard a…

1b3be75

…nd input nodes

update to be the baseline with gold standard intersection, added data…

8ca0e9e

… to the txt files

debugging error in test cases

dda6264

ntalluri commented Aug 21, 2025

View reviewed changes

test/evaluate/test_evaluate.py Outdated Show resolved Hide resolved

tristan-f-r mentioned this pull request Aug 21, 2025

fix: never parse index_col from node data #379

Open

1 task

fix input-nodes file with tristan's update

d8a16fc

ntalluri commented Aug 21, 2025

View reviewed changes

Snakefile Outdated Show resolved Hide resolved

ntalluri added 4 commits August 21, 2025 15:41

update test case

7f3c7f6

clean up code and outputs

8492f70

update to use input nodes and interactome from the snakefile instead …

9b8e31c

…of the functions itself

update test cases

04ddc1f

ntalluri commented Aug 22, 2025

View reviewed changes

spras/evaluation.py Outdated Show resolved Hide resolved

removed unnecessary code

e98d1e2

agitter reviewed Aug 22, 2025

View reviewed changes

ntalluri added 2 commits August 25, 2025 10:12

Merge branch 'main' of github.com:ntalluri/spras into new-baseline

5e10e95

update code to use helper function for input nodes

2004c9a

ntalluri added 2 commits August 25, 2025 11:04

update test case to use new helper function

82fbfce

update comment

dcab781

ntalluri commented Aug 26, 2025

View reviewed changes

Snakefile Outdated Show resolved Hide resolved

ntalluri commented Aug 26, 2025

View reviewed changes

Snakefile Outdated Show resolved Hide resolved

fix space formatting

a57ef19

ntalluri requested a review from agitter August 28, 2025 15:37

github-actions bot added the merge-conflict This PR has merge conflicts. label Oct 3, 2025

tristan-f-r reviewed Oct 14, 2025

View reviewed changes

tristan-f-r added the awaiting-author Author of the PR needs to fix something from a review / etc. label Oct 14, 2025

		# Dropping last elements because scikit-learn adds (1, 0) to precision/recall for plotting, not tied to real thresholds
		prc_input_nodes_baseline_data = {

	restricted to nodes that have at least one of the specified attributes.
	or all of the nodes in this dataset that have at least one 'interesting' attribute as specified in-code.

	@param input_nodes: the input nodes (sources, targets, prizes, actives) used for a specific dataset
	@param input_nodes: the input nodes (usually from `Dataset#get_interesting_input_nodes`) used for a specific dataset

Conversation

ntalluri commented Aug 14, 2025

Uh oh!

read-the-docs-community bot commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Documentation build overview

Uh oh!

ntalluri commented Aug 14, 2025

Current visualization on the egfr dataset

Uh oh!

agitter commented Aug 15, 2025

Uh oh!

ntalluri Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

agitter left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ntalluri Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

agitter commented Oct 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ntalluri commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

read-the-docs-community bot commented Aug 14, 2025 •

edited

Loading

ntalluri Aug 18, 2025 •

edited

Loading

ntalluri Aug 28, 2025 •

edited

Loading