Skip to content

feat(ls): least-squares distance scoring for the C++ search kernel#246

Merged
ms609 merged 3 commits into
cpp-searchfrom
claude/pensive-chandrasekhar-f13d89
Jun 1, 2026
Merged

feat(ls): least-squares distance scoring for the C++ search kernel#246
ms609 merged 3 commits into
cpp-searchfrom
claude/pensive-chandrasekhar-f13d89

Conversation

@ms609
Copy link
Copy Markdown
Owner

@ms609 ms609 commented May 31, 2026

What

Adds a least-squares (LS) distance-fitting capability to the optimised C++ search kernel, so the existing NNI/SPR rearrangement engine can find the tree that best fits a target distance matrix under a least-squares criterion — not just minimise a parsimony score.

Two new exported R functions:

  • LeastSquaresTree(dist, tree = NULL, method, weight, …) — searches topologies for the LS-best fit to dist. Starts from the neighbour-joining tree (default), a single tree, or a multiPhylo (returns the best). Returns an unrooted tree with fitted edge.length and an "RSS" attribute.
  • LeastSquaresFit(tree, dist, method, weight) — fits branch lengths on a fixed topology; the native-kernel analogue of phangorn::nnls.tree().

Both support method = "nnls" (non-negative, default) or "ols", and weight = NULL / "fm" (Fitch–Margoliash 1/D²) / a custom matrix.

Why

This is the topology-search step of Lapointe & Cucumel's (1997, Syst. Biol. 46(2):306–312) average consensus procedure, which a new sibling Consensus package will call via Consensus::Average(method = "ls"). The averaged patristic matrix is generally non-additive (violates the four-point condition), so the best-fitting topology must be found heuristically — exactly what this engine does. Until this lands, Consensus falls back to a slow R path (CustomSearch + phangorn::nnls.tree); the key external requirement met here is a clean, documented entry point: "given a distance matrix and starting tree(s), return the LS-best tree with branch lengths."

How

  • src/ts_ls.{h,cpp} — design matrix over the tree's unrooted branches (splits + pendant edges, 2n−3 columns; the rooted root's two edges merged into one), OLS (normal equations + Cholesky) and NNLS (Lawson–Hanson active set) solvers, weighted RSS, and ls_nni_search / ls_spr_search that reuse the TreeState NNI/SPR move primitives on a topology-only tree (total_words == 0) — the parsimony/Fitch state arrays are never touched.
  • ts_ls_fit / ts_ls_search Rcpp entry points in src/ts_rcpp.cpp, registered in src/TreeSearch-init.c (check_init.R passes).
  • R/LeastSquares.R — the wrappers, with input coercion (dist/matrix, label alignment, rooted-binary coercion, compressed multiPhylo) and roxygen docs.

Validation

  • 58 + 5 = 63 passing tests (tests/testthat/test-LeastSquares.R): additive matrices recover the exact generating topology with RSS ≈ 0 (from NJ, random, and multiPhylo starts); branch lengths/RSS match phangorn::nnls.tree() and OLS matches a direct designTree normal-equation solve, to machine precision; non-additive sanity; weighting; OLS/NNLS agreement; input validation; and graceful handling of rank-deficient fits.
  • Standalone validation script at dev/ls_validate.R.

Reviewer notes

  • Existing parsimony API is unchanged — the C++/Rcpp additions are purely additive (verified via a MaximizeParsimony smoke test and an additive-only RcppExports diff).
  • Each candidate topology is fully re-fitted (LS is global — a move changes many path memberships, so it isn't a cheap incremental update). Fast for the moderate tip counts of the consensus use case (n = 18 search ≈ 0.17 s).
  • Returned trees are unrooted (phangorn convention); the kernel emits the two root edges as (length, 0) so unrooting sums them correctly.
  • Singular (rank-deficient / zero-weight) fits return ok = false and warn rather than crashing — this was a heap-OOB blocker caught in an independent expert review and fixed (with a regression test) before this PR.
  • This branch is based on cpp-search; it also carries one trivial pre-existing commit ("line endings", +3 lines to .gitattributes) that is on local cpp-search but not yet on origin/cpp-search.

🤖 Generated with Claude Code

ms609 and others added 3 commits May 31, 2026 07:14
Add a least-squares (LS) tree-fitting capability to the optimised C++
search kernel, so the existing NNI/SPR rearrangement engine can find the
tree that best fits a target distance matrix under a least-squares
criterion -- not just minimise parsimony. Built for Lapointe & Cucumel's
(1997) average consensus procedure (a new sibling `Consensus` package),
where an averaged, generally non-additive patristic distance matrix must
be fit by a heuristic topology search.

- src/ts_ls.{h,cpp}: design matrix over the tree's unrooted branches
  (splits + pendant edges, 2n-3 columns; root's two edges merged), OLS
  (normal equations + Cholesky) and NNLS (Lawson-Hanson) solvers, RSS,
  and NNI/SPR LS hill-climbing reusing the TreeState move primitives on a
  topology-only tree (never touches the Fitch state arrays).
- ts_ls_fit / ts_ls_search Rcpp entry points (src/ts_rcpp.cpp, registered
  in TreeSearch-init.c).
- R/LeastSquares.R: LeastSquaresFit() (fixed topology) and
  LeastSquaresTree() (search from NULL/NJ, one tree, or a multiPhylo),
  returning unrooted trees with fitted branch lengths and an "RSS" attr.
  Supports OLS/NNLS and unit/Fitch-Margoliash/custom weighting.

Validated against phangorn::nnls.tree() and designTree() to machine
precision (tests + dev/ls_validate.R). The existing parsimony API is
unchanged. Singular (rank-deficient / zero-weight) fits fail gracefully
with a warning rather than crashing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Author names introduced in the LeastSquares documentation and bib entry.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ms609 ms609 merged commit dcdd401 into cpp-search Jun 1, 2026
9 of 10 checks passed
@ms609 ms609 deleted the claude/pensive-chandrasekhar-f13d89 branch June 1, 2026 05:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant