Extending cuik-molmaker to reactions (cuik-reactmaker)#4
Conversation
Implements C++ CGR (Condensed Graph of Reaction) featurization to accelerate chemprop reaction property prediction (~9x over Python path). New C++ code: - src/reaction_features.cpp: batch_reaction_featurizer supporting all 6 RxnModes (REAC_DIFF, REAC_PROD, PROD_DIFF and BALANCE variants) and all 4 atom featurizer modes (V1, V2, ORGANIC, RIGR). Uses O(bonds) hash-map bond enumeration instead of O(n^2) atom-pair scan. - src/features.h: ReactionMode enum, reaction_mode_names_to_array, and parse_reaction declarations; get_atomic_num_onehot_index helper for num_only representation of unmatched atoms - src/one_hot.cpp/h: get_atomic_num_onehot_index for num_only encoding - src/cuik_molmaker_cpp.cpp: exports reaction_mode_names_to_array and batch_reaction_featurizer to Python - CMakeLists.txt: add reaction_features.cpp to cuik_molmaker_core sources Test fixtures: - tests/data/sample_rxns_100.csv: 100 balanced reactions (50 E2 + 50 SN2) plus 10 hand-crafted unbalanced reactions covering num_only and BALANCE mode edge cases. Verified against chemprop CondensedGraphOfReactionFeaturizer across all 6 modes with max_diff=0.
test_reaction_features.py: parametrized over all 4 atom featurizer versions (V1, V2, ORGANIC, RIGR) × all 6 reaction modes (REAC_DIFF, REAC_PROD, PROD_DIFF and BALANCE variants) = 24 test cases. RIGR uses reduced bond features (["is-null", "in-ring"], bond_fdim=2 per side = 4 total) unlike V1/V2/ORGANIC which use 5 bond features (bond_fdim=14 per side = 28 total). Golden .xz files generated from C++ batch_reaction_featurizer output after verifying agreement with chemprop CondensedGraphOfReactionFeaturizer (max_diff=0 on E2/SN2 data across all 6 modes).
|
Hi @sveccham , just a friendly ping on the PR whenever you have a chance. I'd appreciate any feedback/comments before it is ready to merge. |
|
Hi @akshatzalte, Thank you for your patience. I will have a look at this PR in the next few days. We are currently working on releasing |
sveccham
left a comment
There was a problem hiding this comment.
Hi @akshatzalte,
Thanks again for the PR. Very neat idea using a hashmap to reduce the problem complexity. This is also a very good use case for moving to C++.
I had a high level look at the PR and left a couple of initial comments. I had one question about the overall design. The CGR bond lookup is constructed once for each reaction in a batch. Would it be more computationally efficient to construct it for the entire batch (with appropriate atom and bond offsets) at once and exploit the O(1) lookup?
| offset_carbon, | ||
| ReactionMode(mode_int)); | ||
| }, | ||
| "Featurize a batch of reactions (CGR) and return 5 NumPy arrays."); |
There was a problem hiding this comment.
List what the 5 numpy arrays are briefly.
There was a problem hiding this comment.
What is the source of these sample reactions? We should make sure they are references appropriately and comply with license (if any)
|
Hi @sveccham , For the high-level design question, doing a batch level map construction won't be super advantageous. My current understanding is that merging into a single batch-wide table wouldn't change the number of lookups: since bonds only exist within a single reaction, the queries are all intra-reaction and their count is set by the pair scan rather than by how the table is partitioned. So a batch table would be queried the same number of times. It would be just larger and colder in cache, with some added offset bookkeeping. If that's right, the bigger lever for the bond step might be the iteration strategy rather than the table's scope: replacing the O(atoms^2) pair scan with direct bond iteration (O(atoms + bonds)), still reproducing chemprop's exact edge ordering. I'd be very happy to prototype that and benchmark it if you think it's worth it. This would make the code look significantly different from the current python-only handling in chemprop and the real advantage will be visible for super large molecules only. Open to your thoughts on this. |
Add
batch_reaction_featurizerfor CGR reaction featurizationWhat this PR adds
A new
batch_reaction_featurizerfunction — the reaction analogue ofbatch_mol_featurizer— using the Condensed Graph of Reaction (CGR) representation (same as Chemprop'sCondensedGraphOfReactionFeaturizer). API is consistent with the existing package:Supported: all 4 atom featurizer modes (V1, V2, ORGANIC, RIGR) and all 6 reaction modes (
REAC_DIFF,REAC_PROD,PROD_DIFF, and their_BALANCEvariants).keep_h/add_hsemantics match Chemprop'smake_molexactly.Implementation notes
All new code is in
src/reaction_features.cpp(701 lines); additions tofeatures.h,one_hot.cpp, andcuik_molmaker_cpp.cpp. No existing code was modified.Two design choices worth noting:
parse_rxn_side_moldoes not clear atom-map numbers. The existingparse_molstrips them for reordering purposes; reaction featurization needs them for reactant↔product correspondence. Fully additive — molecule featurization is unchanged.unordered_map<uint64_t, size_t>keyed by(min_idx << 32) | max_idx), replacing the Python CGR featurizer's O(n²) atom-pair scan.Node ordering matches Chemprop exactly: reactant atoms 0..n_reac−1, then product-only atoms n_reac..n_cgr−1.
Correctness
Verified against Chemprop's Python
CondensedGraphOfReactionFeaturizeron:[H-:2]lone-hydride nucleophiles)num_onlyvs BALANCE divergence)A test fixture CSV (
tests/data/sample_rxns_100.csv, 110 reactions) is included. Golden.xzreference files can be committed once you've had a look at the design.Speedup benchmarks present here