feat: add input nodes as evaluation baseline#372
feat: add input nodes as evaluation baseline#372ntalluri wants to merge 16 commits intoReed-CompBio:mainfrom
Conversation
Documentation build overview
Show files changed (3 files in total): 📝 3 modified | ➕ 0 added | ➖ 0 deleted
|
|
I thought this pull request was for something else. Could we rename it to be more like "feat: add input nodes as evaluation baseline"? |
spras/evaluation.py
Outdated
| # TODO what if the node_ensemble is all frequency = 0.0, that will be the new source/target/prize/ baseline? | ||
|
|
||
| # Set frequency to 1.0 for matching nodes | ||
| prize_node_ensemble_df.loc[ |
There was a problem hiding this comment.
make the baseline 1 baseline per dataset not per algorithm. All the sources/prizes/targest/active = 1.0 and everything else is 0. But still calculate the precision and recall.
There was a problem hiding this comment.
this will be algorithm agnostic.
There was a problem hiding this comment.
Another option also maybe needed is to only evaluate the internal nodes (set the source/target/prizes to 0.0).
There was a problem hiding this comment.
we might need both of these. One will be a baseline on the current ensemble pr. New figures for all the evaluation for only evaluating the internal nodes.
| # the Input_Nodes_Baseline PR curve highlights their overlap with the gold standard. | ||
| if prc_input_nodes_baseline_df is None: | ||
| input_nodes_set = set(input_nodes['NODEID']) | ||
| input_nodes_gold_intersection = input_nodes_set & gold_standard_nodes # TODO should this be all inputs nodes or the intersection with the gold standard for this baseline? I think it should be the intersection |
There was a problem hiding this comment.
Very good question. For a synthetic dataset like Panther pathways, the inputs are sampled from the pathway nodes that make up the gold standard so it doesn't matter.
For an omics input like EGFR, it matters substantially. What makes you prefer the intersection? I was inclined to say all input nodes because we cannot have a baseline algorithm that makes use of gold standard information as part of its ranking. I could create a valid pathway reconstruction algorithm that takes the input nodes and simply returns those. I can't use a gold standard in a valid pathway reconstruction algorithm, however.
There was a problem hiding this comment.
In my opinion, the only input nodes that matter for evaluation are those that overlap with the gold standard. While it’s true that an algorithm could trivially return all input nodes, our precision recall evaluation is only defined with respect to the gold standard. Input nodes that aren’t in the gold standard don’t contribute to true positives, so including them in the baseline wouldn’t be meaningful.
That said, I also see the case for using all input nodes as it represents a valid baseline algorithm where an algorithm could simply return the given inputs without any reconstruction.
Maybe the difference is that the intersection provides an upper bound, while all input nodes provides a lower bound on what you could do without doing any reconstruction and we should provide both?
There was a problem hiding this comment.
think through 3 baselines and explain why each of these can be used/needed
- the no intersection, but input nodes by itself
- the intersection of the gold standard and input nodes
- the the gold standard by itself (which I think is what baseline is)
- or do we want to make an ensemble of the gold standard as 1.0 and everything else 0.0 and do a PRC
There was a problem hiding this comment.
Deciding this is the last point of feedback and last decision to finalize. Then I can do a last careful review and we should be ready to merge.
In our meeting, we discussed how options 1 and 2 will give the same recall. The only difference is that one will have some precision value and the other always has precision of 1.0.
|
@ntalluri is this waiting for my review or your updates? We haven't touched it in a while. |
| raise ValueError( | ||
| f"Cannot compute PR curve or generate node ensemble. Input network for dataset \"{dataset_file.split('-')[0]}\" is empty." | ||
| f"Cannot compute PR curve or generate node ensemble. The input network is empty." | ||
| ) | ||
| if node_table.empty: | ||
| raise ValueError( | ||
| f"Cannot compute PR curve or generate node ensemble. Gold standard associated with dataset \"{dataset_file.split('-')[0]}\" is empty." | ||
| f"Cannot compute PR curve or generate node ensemble. The gold standard is empty." | ||
| ) |
There was a problem hiding this comment.
We should still supply the dataset name in this function to preserve the error information.
| # Dropping last elements because scikit-learn adds (1, 0) to precision/recall for plotting, not tied to real thresholds | ||
| prc_input_nodes_baseline_data = { |
There was a problem hiding this comment.
(The comment should be moved up)
| # Dropping last elements because scikit-learn adds (1, 0) to precision/recall for plotting, not tied to real thresholds | |
| prc_input_nodes_baseline_data = { | |
| # Dropping last elements because scikit-learn adds (1, 0) to precision/recall for plotting, not tied to real thresholds | |
| # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html#sklearn.metrics.precision_recall_curve:~:text=Returns%3A-,precision,predictions%20with%20score%20%3E%3D%20thresholds%5Bi%5D%20and%20the%20last%20element%20is%200.,-thresholds | |
| prc_input_nodes_baseline_data = { |
| def get_interesting_input_nodes(self) -> pd.DataFrame: | ||
| """ | ||
| Returns: a table listing the input nodes considered as starting points for pathway reconstruction algorithms, | ||
| restricted to nodes that have at least one of the specified attributes. |
There was a problem hiding this comment.
| restricted to nodes that have at least one of the specified attributes. | |
| or all of the nodes in this dataset that have at least one 'interesting' attribute as specified in-code. |
|
|
||
| @param node_ensembles: dict of the pre-computed node_ensemble(s) | ||
| @param node_table: gold standard nodes | ||
| @param input_nodes: the input nodes (sources, targets, prizes, actives) used for a specific dataset |
There was a problem hiding this comment.
| @param input_nodes: the input nodes (sources, targets, prizes, actives) used for a specific dataset | |
| @param input_nodes: the input nodes (usually from `Dataset#get_interesting_input_nodes`) used for a specific dataset |

Working on #370