Skip to content

fix: extract Express.js anonymous route handler callbacks#49

Merged
ar7casper merged 4 commits into
knostic:release/2026-05-12from
joshbouncesecurity:fix/issue21-express-handlers
May 13, 2026
Merged

fix: extract Express.js anonymous route handler callbacks#49
ar7casper merged 4 commits into
knostic:release/2026-05-12from
joshbouncesecurity:fix/issue21-express-handlers

Conversation

@joshbouncesecurity
Copy link
Copy Markdown
Contributor

Summary

The JS/TS parser pipeline did not extract anonymous arrow function callbacks used as Express route handlers, so most application code in a typical Express app was invisible to analysis.

This PR adds detection in typescript_analyzer.js:

  • Recognises Express-style call sites (<obj>.<verb>(<path>, ...callbacks) with verb in get|post|put|patch|delete|options|head|all|use).
  • Filters the receiver to plausibly-Express identifiers (app, router, routes, server, plus chained .route(...)/.Router() calls) so generic .get(...) calls on caches/clients/query-builders aren't misread as routes.
  • Extracts anonymous arrow / function expressions in callback positions as units, marking the last as route_handler (with is_entry_point: true) and earlier callbacks as route_middleware.
  • Adds metadata: http_method, http_path, callback_index, named_middleware.
  • Emits call-graph edges from each anonymous callback to named middleware identifiers in the same call (e.g. authenticateToken); dependency_resolver.js merges these "explicit" edges with the body-text regex edges so reachability/upstream-deps see the relationship.

unit_generator.js surfaces the new fields on each unit (route, is_entry_point, http_method, http_path, callback_index). Existing route_handlers detected by the per-function classifier still flow through unchanged.

Addresses #21 (does not close — verify against the user's reported repo first).

Test plan

  • tests/parsers/javascript/test_express_route_handlers.py covers:
    • anonymous handler with named middleware: extraction, is_entry_point=true, http_method/http_path, call-graph edge to authenticateToken
    • handler with no middleware: clean unit with no extra edges
    • router.use('/api', anonMw1, anonMw2, anonHandler): one route_handler + two route_middleware
    • non-Express myCache.get('foo', () => {}) and queryBuilder.post(...): nothing extracted
    • named-only router.get('/x', namedHandler): no anonymous unit synthesised; named function still extracted by the regular path
  • Existing JS parser tests (tests/test_js_parser.py) still pass.
  • Full suite: pytest tests/ -> 96 passed, 12 skipped, 3 xfailed (preexisting Windows-only xfails).
  • Manual: parse a small Express app — anonymous route handler bodies appear as units in dataset.json, is_entry_point: true, route populated.

joshbouncesecurity and others added 3 commits May 4, 2026 21:42
The JS/TS parser pipeline did not extract anonymous arrow function
callbacks used as Express route handlers, so most application code in a
typical Express app was invisible to analysis.

This change adds detection in typescript_analyzer.js:

- Recognises Express-style call sites (`<obj>.<verb>(<path>, ...callbacks)`
  with verb in `get|post|put|patch|delete|options|head|all|use`).
- Filters the receiver to plausibly-Express identifiers (`app`, `router`,
  `routes`, `server` and chained `.route(...)`/`.Router()`) so generic
  `.get(...)` calls on caches/clients aren't misread as routes.
- Extracts anonymous arrow / function expressions in callback positions
  as units, marking the last as `route_handler` (with
  `is_entry_point: true`) and earlier callbacks as `route_middleware`.
- Adds metadata: `http_method`, `http_path`, `callback_index`,
  `named_middleware`.
- Records explicit call-graph edges from each anonymous callback to
  named middleware identifiers in the same call (e.g.
  `authenticateToken`); dependency_resolver.js merges these with the
  body-text regex edges so reachability/upstream-deps work.
- unit_generator.js surfaces the metadata on the unit (`route`,
  `is_entry_point`, `http_method`, `http_path`, `callback_index`).

Addresses #21.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
The Express anonymous route handler extraction added synthetic entries
to `data["functions"]` but not to `data["callGraph"]`, breaking the
existing invariant `len(callGraph) == len(functions)` exercised by
`test_js_parser.TestTypeScriptAnalyzer::test_builds_call_graph`.

Emit a callGraph entry for each synthesised route_handler /
route_middleware unit, capturing inline call expressions from the
callback body. Named middleware identifiers continue to flow through
`explicitCalls` and are merged downstream by `dependency_resolver.js`.

Add a regression test asserting the callGraph/functions invariant for
the synthetic units.
Adds four regression tests for `_extractExpressRouteCallbacks`:

- TypeScript-annotated callbacks `(req: Request, res: Response) => {...}`
  parse correctly and produce the expected synthetic handler unit.
- `app.get('/' + prefix, handler)` (dynamic path) is skipped without
  throwing — confirms the StringLiteral check is a hard gate.
- `app.use((req, res, next) => {...})` with no path produces a single
  unit with http_path=null and http_method='USE'.
- `app.get(path, anonMw, namedHandler)` (anon middleware before named
  handler): the anon callback gets a route_middleware unit with
  named_middleware=['namedHandler'], and the named handler is left to
  the regular extractor.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@joshbouncesecurity
Copy link
Copy Markdown
Contributor Author

Manual verification

Sample Express file:

const router = express.Router();
router.post('/orders', authenticateToken, async (req, res) => {
  const { productId } = req.body;
  // ...
});
  • openant parse <repo>: dataset.json contains a unit for the anonymous POST /orders handler with is_entry_point: true, http_method: "POST", http_path: "/orders".
  • Call graph: edge from the synthesized handler to authenticateToken.
  • app.get('/users', (req, res) => res.json([])): extracted with no extra middleware edges.
  • router.use('/api', limiter, auth, async (req, res, next) => {...}) (multi-middleware, all anonymous): one route_handler unit + two route_middleware units.
  • False-positive guard: myCache.get('foo', () => {}) and queryBuilder.post('users', () => {}): NOT extracted (correctly skipped — non-Express receivers).
  • Named handler: router.get('/x', namedHandler): no anonymous unit synthesized; namedHandler extracted by the existing path.
  • TS-typed callback: (req: Request, res: Response): Promise<void> => {...}: extracted correctly with metadata.
  • Dynamic path: app.get('/' + prefix, ...) (path is not a string literal): correctly skipped, no crash.
  • On the issue's reported repo (Express app with 7 vulnerabilities): unit count grows; reachability_filter retains the route handlers.

@joshbouncesecurity
Copy link
Copy Markdown
Contributor Author

Local test results

Built a tiny inline Express fixture (1 file, 13 lines) and ran the JS pipeline (typescript_analyzer + unit_generator) from this branch.

Fixture (orders.js):

const express = require('express');
const router = express.Router();

function authenticateToken(req, res, next) { return next(); }

router.post('/orders', authenticateToken, async (req, res) => {
  const { productId } = req.body;
  res.json({ ok: true, productId });
});

module.exports = router;

Commands run:

node typescript_analyzer.js <repo> --files-from list.txt --output analyzer_output.json
node unit_generator.js analyzer_output.json --output dataset.json

Outcome (per checklist item):

  • Anonymous POST /orders handler appears as a unit (orders.js:express(POST:/orders:8:1)) ✅
  • is_entry_point: true on the synthesized handler ✅
  • HTTP method and path are present — but on the unit's route field as route.method = 'POST' / route.path = '/orders' (the top-level http_method / http_path / callback_index shown in the checklist are not populated on the unit; they may be intentionally consumed only at intermediate stages or there may be a small surfacing gap in unit_generator.js) ⚠️
  • Call-graph edge from the anonymous handler to authenticateToken is present in metadata.direct_calls = ['orders.js:authenticateToken']; the named middleware also lists the handler as a caller in its direct_callers
  • route.middleware = ['authenticateToken'] correctly captures the named middleware ✅
  • Did not separately exercise the multi-middleware, false-positive guard (myCache.get, queryBuilder.post), TS-typed callback, dynamic-path skip, or the originally-reported repo on this run — those are covered by tests/parsers/javascript/test_express_route_handlers.py per the diff.

One small thing worth a look: the manual-verification checklist mentions top-level http_method / http_path / callback_index fields, but in dataset.json only route.method / route.path are populated. Either the checklist text or unit_generator.js's field-surfacing might want a small alignment.

Copy link
Copy Markdown
Collaborator

@ar7casper ar7casper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid PR overall — the receiver filter, named-middleware-as-call-graph-edge wiring, and the callGraph invariant test are all real improvements over a naive "extract anything that looks like a callback" approach. One blocker, two medium nits below.

Blocker: anonymous middleware bodies are silently dropped from analysis.

The new route_middleware unitType isn't in ENTRY_POINT_TYPES in utilities/agentic_enhancer/entry_point_detector.py:26-31 (currently {route_handler, view_function, websocket_handler, cli_handler}), and the synth units have isEntryPoint: false. The chain:

  1. Analyzer creates route_middleware units ✅
  2. Entry point detector doesn't flag them (unitType not in ENTRY_POINT_TYPES, isEntryPoint false)
  3. Reachability filter has no entry point to reach them from (Express invokes middleware via the framework, not via a code-resolvable call edge)
  4. Filtered out before the LLM sees them.

Net effect: a route like

app.post('/x',
  async (req, res, next) => { /* anonymous middleware doing dangerous thing */ next(); },
  async (req, res) => { /* the handler */ }
);

gets a unit for the handler ✅ but the middleware body is silently dropped — even though it sees req directly and could be doing absolutely anything dangerous. That's the inverse of what #21 asks for.

One-line fix: add 'route_middleware' to ENTRY_POINT_TYPES. Or set isEntryPoint: true for all callbacks (matches the simpler-but-correct approach in #48). I'd suggest the former since you've already done the structural work to distinguish handler vs middleware — making both entry points preserves that distinction without losing analysis coverage. Worth adding a regression test that asserts the middleware unit survives the reachability filter end-to-end.

Two inline nits below (medium severity, not blockers).

Also worth flagging for follow-up (non-blocking):

  • route_middleware is a new unit_type the LLM enhancer prompt has no specific guidance for. The schema accepts arbitrary strings so no crash, but the LLM doesn't have a mental model for "this is middleware vs the handler." Probably fine, just noting.
  • You flagged the spurious self-edge (UserController.get → UserController.get) on #39 a couple days ago — extractCallsFromFunction(arg, relativePath) is called per-handler here too, same code path, same risk. Worth checking if the same pattern shows up in this PR's output.

* `.get(...)` calls on caches / clients / query-builders aren't misread
* as routes.
*/
_isPlausibleExpressReceiver(receiver) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The receiver filter has known false negatives for common naming conventions: codebases that use single-word identifiers like web, api, endpoints, controller for the Express app/router get no extraction at all, since those names don't end in app/router/routes/server. From a quick scan of popular Express apps in the wild, web and api are not unusual choices.

Suggestions (non-blocking):

  • Either expand the suffix list (add endpoints, controller, web, api — or treat any short identifier as a candidate and rely on the verb-set + string-path filter to reject false positives), or
  • Document the assumption in the JSDoc and the route_handler extraction docstring so reviewers / users understand why their app might silently produce zero route units.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f869c97. Expanded the stem list to include web, api, endpoints, and controller, and consolidated both the Identifier and PropertyAccessExpression branches to share a single static EXPRESS_RECEIVER_STEMS field so they can't drift out of sync. Updated the JSDoc to document the remaining coverage boundary (identifiers outside the stem list will produce zero units).

} else if (k === "PropertyAccessExpression") {
// e.g. middleware.auth — keep the trailing name
const name = arg.getName ? arg.getName() : arg.getText();
namedMiddleware.push(name);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For PropertyAccessExpression args (app.post('/x', middleware.auth, ...)), this stores just the trailing name ("auth"), which _resolveCall will then look up in the global function map. If there's any other function called auth anywhere in the codebase (e.g. someClass.auth), _resolveCall may resolve to the wrong one — silent false-positive call graph edge.

Suggestions (non-blocking):

  • Store the full arg.getText() (e.g. "middleware.auth") and let dependency_resolver's existing _resolveMethodCall handle the property-access pattern.
  • Or, if you want to keep the simple-name semantics, document the limitation and add a test exercising the ambiguous-name case so the behavior is intentional.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documented with a comment explaining the known limitation — trailing-name-only storage (e.g. "auth" from "middleware.auth") can resolve to the wrong target if another function shares the same simple name. Chose documentation over a dual-file refactor (also touching dependency_resolver.js) as the simpler, lower-risk path for a non-blocking nit.

…filter

Blocker: add 'route_middleware' to ENTRY_POINT_TYPES so anonymous Express
middleware bodies are not silently dropped by the reachability filter. Add
regression tests for EntryPointDetector that assert both route_handler and
route_middleware units survive as entry points.

Non-blocking nit 1: expand _isPlausibleExpressReceiver to accept common
single-word app/router identifiers (web, api, endpoints, controller) that
were previously missed, causing zero route units to be emitted for those
codebases. Consolidate the stem list into a static class field so both the
Identifier and PropertyAccessExpression branches stay in sync.

Non-blocking nit 2: document the known limitation that PropertyAccessExpression
middleware args are stored as trailing-name only (e.g. "auth" from
"middleware.auth"), which can produce false-positive call-graph edges when
another function shares the same simple name.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@joshbouncesecurity
Copy link
Copy Markdown
Contributor Author

Actioned all review comments in f869c97 — summary of what was fixed:

Blocker: route_middleware silently dropped
Added 'route_middleware' to ENTRY_POINT_TYPES in entry_point_detector.py. Anonymous Express middleware bodies now survive the reachability filter and are passed to the LLM. Added a regression test in tests/test_entry_point_detector.py asserting both route_handler and route_middleware units are detected as entry points end-to-end.

Nit 1: receiver filter false negatives
Expanded _isPlausibleExpressReceiver to cover web, api, endpoints, and controller as accepted stems. Consolidated the stem list into a single static EXPRESS_RECEIVER_STEMS field shared by both the Identifier and PropertyAccessExpression branches so they stay in sync. Updated JSDoc to document what's still out of scope.

Nit 2: PropertyAccessExpression trailing-name ambiguity
Documented the known limitation with an inline comment rather than the more invasive dual-file refactor. The comment explains that storing only the trailing name (e.g. "auth" from "middleware.auth") can produce a false-positive call-graph edge if another function shares the same simple name.

Self-edge concern (checked, no change)
extractCallsFromFunction when called on an ArrowFunction node calls .getDescendantsOfKind(funcNode.getKind()) — which returns nested arrow functions — then filters for CallExpression, producing an empty array. No calls are emitted from this path, so spurious self-edges cannot occur here. The implementation is effectively a placeholder; noted but out of scope for this PR.

@joshbouncesecurity joshbouncesecurity marked this pull request as ready for review May 12, 2026 15:18
@joshbouncesecurity
Copy link
Copy Markdown
Contributor Author

@ar7casper responded

@joshbouncesecurity joshbouncesecurity changed the base branch from master to release/2026-05-12 May 12, 2026 16:53
Copy link
Copy Markdown
Collaborator

@ar7casper ar7casper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round-2 review — thanks for actioning all 3 round-1 findings cleanly in f869c97.

Verified each fix:

Round-1 Status
🟠 Blocker — route_middleware not in ENTRY_POINT_TYPES ✅ Added at entry_point_detector.py:28 + new regression test in test_entry_point_detector.py
🟡 Medium — Receiver filter false negatives ✅ Added web/api/endpoints/controller to EXPRESS_RECEIVER_STEMS static field (consolidated for both Identifier and PropertyAccess branches)
🟡 Medium — Named-middleware trailing-name ⚖️ Documented as known limitation at typescript_analyzer.js:406-410. Reasonable call — fix scope was non-trivial. Comment names the failure mode for future readers.

CI all green on Ubuntu/macOS/Windows. Solid PR.

One Medium worth raising for awareness (not blocking — it's pre-existing) and two housekeeping Lows:


🟡 Medium — JS pipeline doesn't normalize unitTypeunit_type (pre-existing, not introduced by this PR)

The new regression test (test_route_middleware_is_entry_point) passes in isolation because it constructs func_data with unit_type (snake_case) directly. But in the JS production pipeline the chain is:

  1. typescript_analyzer.js:446 emits unitType: "route_middleware" (camelCase)
  2. parsers/javascript/test_pipeline.py:497 reads functions = analyzer.get("functions", {})no normalization
  3. entry_point_detector.py:176 reads func_data.get('unit_type', '') (snake_case) — misses the camelCase key
  4. unit_type Check 1 silently returns no match for ALL JS units
  5. Only Check 3 (input patterns: req.body, req.params, etc. in code body) actually fires

For typical Express middleware that touch req.body/req.headers, Check 3 catches them — practical effect is fine. But for minimal middleware like async (req, res, next) => next(); (rare but possible), Check 3 doesn't match and the unit gets dropped despite this PR's ENTRY_POINT_TYPES fix.

This is the documented bug per parser-issues #23-26 + Room-for-Improvement #10 — predates this PR. Not blocking for merge, but worth noting because the regression test gives somewhat-false confidence: the structural piece is correct, but the production effect for JS depends on the broader normalization work.

Suggestions (any one, all non-blocking):

  • Most thorough: fix the upstream parser-issues #23-26 in a separate PR — normalize JS analyzer output to use snake_case before passing to EntryPointDetector.
  • PR-scoped: add an end-to-end test that runs the full JS pipeline (typescript_analyzer.jsunit_generator.jsEntryPointDetector reading the actual analyzer output) and asserts a route_middleware unit with no req.X body references survives the reachability filter. Would surface the gap concretely.
  • Minimum: acknowledge in PR body that the structural fix is complete but practical effect for JS middleware depends on the broader unit_type normalization work.

🟢 Low — PR body stale

Body still says Filters the receiver to plausibly-Express identifiers ("app", "router", "routes", "server", plus chained ".route(...)"/".Router()" calls) — but the actual stems list is now app|router|routes|server|web|api|endpoints|controller. Worth updating to list all 8 stems.


🟢 Low — New receiver patterns lack test coverage

The new stems (web, api, endpoints, controller) work as designed, but no test exercises them. Existing tests use router and app. A future refactor that breaks the regex for one of the new patterns wouldn't be caught.

Suggestion: parameterize the existing happy-path test, or add 1-2 new fixtures with web.get(...) and api.post(...) to lock in the expanded behavior.


Verdict: ✅ ready to merge. Approving via UI.

@ar7casper ar7casper merged commit 6223d86 into knostic:release/2026-05-12 May 13, 2026
9 checks passed
@ar7casper ar7casper mentioned this pull request May 13, 2026
3 tasks
ar7casper added a commit that referenced this pull request May 13, 2026
Covers the seven PRs in this release:
- #35: parse --level default → reachable (CLI consistency with scan + Python CLI)
- #36: auto-detect dep changes via ~/.openant/venv/.deps-hash
- #37: lazy JS parser npm bootstrap on first use
- #39: TypeScript/NestJS DI-aware call resolution (constructor + field + functional inject())
- #40: --language auto opt-in for openant init + non-git path support + shared config/languages.json
- #49: Express anonymous route handler extraction (route_handler / route_middleware)
- #50: --llm-reachability opt-in stage + cross-parser call_graph.json contract

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants