Extending cuik-molmaker to reactions (cuik-reactmaker) by akshatzalte · Pull Request #4 · NVIDIA-BioNeMo/cuik-molmaker

akshatzalte · 2026-05-22T16:48:17Z

Add `batch_reaction_featurizer` for CGR reaction featurization

What this PR adds

A new batch_reaction_featurizer function — the reaction analogue of batch_mol_featurizer — using the Condensed Graph of Reaction (CGR) representation (same as Chemprop's CondensedGraphOfReactionFeaturizer). API is consistent with the existing package:

atom_onehot = cuik_molmaker.atom_onehot_feature_names_to_array([...])
atom_float  = cuik_molmaker.atom_float_feature_names_to_array(['aromaticity', 'mass'])
bond_feats  = cuik_molmaker.bond_feature_names_to_array([...])
mode_int    = cuik_molmaker.reaction_mode_names_to_array(['REAC_DIFF'])[0]

V, E, edge_index, rev_edge_index, batch = cuik_molmaker.batch_reaction_featurizer(
    reac_smiles_list, prod_smiles_list,
    atom_onehot, atom_float, bond_feats,
    keep_h=True, add_h=False, offset_carbon=False, mode=mode_int,
)
# Returns 5 NumPy arrays — same convention as batch_mol_featurizer

Supported: all 4 atom featurizer modes (V1, V2, ORGANIC, RIGR) and all 6 reaction modes (REAC_DIFF, REAC_PROD, PROD_DIFF, and their _BALANCE variants). keep_h/add_h semantics match Chemprop's make_mol exactly.

Implementation notes

All new code is in src/reaction_features.cpp (701 lines); additions to features.h, one_hot.cpp, and cuik_molmaker_cpp.cpp. No existing code was modified.

Two design choices worth noting:

parse_rxn_side_mol does not clear atom-map numbers. The existing parse_mol strips them for reordering purposes; reaction featurization needs them for reactant↔product correspondence. Fully additive — molecule featurization is unchanged.
Bond lookup is O(bonds) via hash map (unordered_map<uint64_t, size_t> keyed by (min_idx << 32) | max_idx), replacing the Python CGR featurizer's O(n²) atom-pair scan.

Node ordering matches Chemprop exactly: reactant atoms 0..n_reac−1, then product-only atoms n_reac..n_cgr−1.

Correctness

Verified against Chemprop's Python CondensedGraphOfReactionFeaturizer on:

E2 dataset (1264 reactions, atom-mapped explicit H)
SN2 dataset (2362 reactions, includes [H-:2] lone-hydride nucleophiles)
10 hand-crafted unbalanced reactions (num_only vs BALANCE divergence)

A test fixture CSV (tests/data/sample_rxns_100.csv, 110 reactions) is included. Golden .xz reference files can be committed once you've had a look at the design.

Speedup benchmarks present here

Implements C++ CGR (Condensed Graph of Reaction) featurization to accelerate chemprop reaction property prediction (~9x over Python path). New C++ code: - src/reaction_features.cpp: batch_reaction_featurizer supporting all 6 RxnModes (REAC_DIFF, REAC_PROD, PROD_DIFF and BALANCE variants) and all 4 atom featurizer modes (V1, V2, ORGANIC, RIGR). Uses O(bonds) hash-map bond enumeration instead of O(n^2) atom-pair scan. - src/features.h: ReactionMode enum, reaction_mode_names_to_array, and parse_reaction declarations; get_atomic_num_onehot_index helper for num_only representation of unmatched atoms - src/one_hot.cpp/h: get_atomic_num_onehot_index for num_only encoding - src/cuik_molmaker_cpp.cpp: exports reaction_mode_names_to_array and batch_reaction_featurizer to Python - CMakeLists.txt: add reaction_features.cpp to cuik_molmaker_core sources Test fixtures: - tests/data/sample_rxns_100.csv: 100 balanced reactions (50 E2 + 50 SN2) plus 10 hand-crafted unbalanced reactions covering num_only and BALANCE mode edge cases. Verified against chemprop CondensedGraphOfReactionFeaturizer across all 6 modes with max_diff=0.

test_reaction_features.py: parametrized over all 4 atom featurizer versions (V1, V2, ORGANIC, RIGR) × all 6 reaction modes (REAC_DIFF, REAC_PROD, PROD_DIFF and BALANCE variants) = 24 test cases. RIGR uses reduced bond features (["is-null", "in-ring"], bond_fdim=2 per side = 4 total) unlike V1/V2/ORGANIC which use 5 bond features (bond_fdim=14 per side = 28 total). Golden .xz files generated from C++ batch_reaction_featurizer output after verifying agreement with chemprop CondensedGraphOfReactionFeaturizer (max_diff=0 on E2/SN2 data across all 6 modes).

akshatzalte · 2026-06-14T21:26:57Z

Hi @sveccham , just a friendly ping on the PR whenever you have a chance. I'd appreciate any feedback/comments before it is ready to merge.

sveccham · 2026-06-15T19:45:20Z

Hi @akshatzalte,

Thank you for your patience. I will have a look at this PR in the next few days. We are currently working on releasing v0.2.1.

sveccham

Hi @akshatzalte,
Thanks again for the PR. Very neat idea using a hashmap to reduce the problem complexity. This is also a very good use case for moving to C++.

I had a high level look at the PR and left a couple of initial comments. I had one question about the overall design. The CGR bond lookup is constructed once for each reaction in a batch. Would it be more computationally efficient to construct it for the entire batch (with appropriate atom and bond offsets) at once and exploit the O(1) lookup?

sveccham · 2026-06-16T19:40:11Z

+                                       offset_carbon,
+                                       ReactionMode(mode_int));
+    },
+    "Featurize a batch of reactions (CGR) and return 5 NumPy arrays.");


List what the 5 numpy arrays are briefly.

sveccham · 2026-06-16T19:43:59Z

What is the source of these sample reactions? We should make sure they are references appropriately and comply with license (if any)

akshatzalte · 2026-06-26T22:09:21Z

Hi @sveccham ,
Thank you for your comments. I will resolve them over this weekend.

For the high-level design question, doing a batch level map construction won't be super advantageous. My current understanding is that merging into a single batch-wide table wouldn't change the number of lookups: since bonds only exist within a single reaction, the queries are all intra-reaction and their count is set by the pair scan rather than by how the table is partitioned. So a batch table would be queried the same number of times. It would be just larger and colder in cache, with some added offset bookkeeping.

If that's right, the bigger lever for the bond step might be the iteration strategy rather than the table's scope: replacing the O(atoms^2) pair scan with direct bond iteration (O(atoms + bonds)), still reproducing chemprop's exact edge ordering. I'd be very happy to prototype that and benchmark it if you think it's worth it. This would make the code look significantly different from the current python-only handling in chemprop and the real advantage will be visible for super large molecules only. Open to your thoughts on this.

akshatzalte added 3 commits May 22, 2026 12:40

.gitignore: ignore build output and local scratch files

64e8070

sveccham reviewed Jun 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extending cuik-molmaker to reactions (cuik-reactmaker)#4

Extending cuik-molmaker to reactions (cuik-reactmaker)#4
akshatzalte wants to merge 3 commits into
NVIDIA-BioNeMo:mainfrom
akshatzalte:feat/cgr-reaction-featurization

akshatzalte commented May 22, 2026

Uh oh!

akshatzalte commented Jun 14, 2026

Uh oh!

sveccham commented Jun 15, 2026

Uh oh!

sveccham left a comment

Uh oh!

sveccham Jun 16, 2026

Uh oh!

sveccham Jun 16, 2026

Uh oh!

akshatzalte commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

akshatzalte commented May 22, 2026

Add batch_reaction_featurizer for CGR reaction featurization

What this PR adds

Implementation notes

Correctness

Uh oh!

akshatzalte commented Jun 14, 2026

Uh oh!

sveccham commented Jun 15, 2026

Uh oh!

sveccham left a comment

Choose a reason for hiding this comment

Uh oh!

sveccham Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

sveccham Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

akshatzalte commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add `batch_reaction_featurizer` for CGR reaction featurization