Skip to content

fix: treat Scrypt WS Connection closed as transient#3655

Open
TaprootFreak wants to merge 2 commits intodevelopfrom
fix/scrypt-ws-connection-resilience
Open

fix: treat Scrypt WS Connection closed as transient#3655
TaprootFreak wants to merge 2 commits intodevelopfrom
fix/scrypt-ws-connection-resilience

Conversation

@TaprootFreak
Copy link
Copy Markdown
Collaborator

Summary

Fixes liquidity-management pipelines being permanently FAILED on transient Scrypt WebSocket disconnects.

When the Scrypt WS drops, all pending requests are rejected with new Error('Connection closed'). This was surfaced as OrderFailedException in ScryptAdapter.checkTradeCompletion, marking the order Failed → action 233 has no onFail → pipeline FAILED → rule auto-paused → mail. Meanwhile the underlying order on Scrypt is unaffected and the funds are not moved.

Changes

  • scrypt-websocket-connection.ts: Extract the retry logic added in Various improvements #3594 into a private helper retryOnTransientWsError and apply it to fetch (was previously only on fetchAll). fetch is used by fetchExecutionReports and fetchOrderBook, both on the hot path of checkTrade.
  • scrypt.adapter.ts: In checkTradeCompletion, classify Connection closed / unknown reqid as transient → return false so the order stays IN_PROGRESS and is retried on the next 10s cron tick. Genuine errors still throw OrderFailedException.

Repro / data point

  • Pipeline 60738 (2026-04-30 07:21 UTC, rule 313 Scrypt/EUR redundancy)
  • Order 122805 (sell 70'003.98 EUR → USDT): 5 ClOrdIds in 3 min (4 edits), final WS drop → Connection closed → wrongly marked Failed
  • Balance audit on Scrypt: EUR was never spent (next pipeline 60741 sold the same EUR + 40k more)
  • 20+ similar incidents since 2026-03-24, all matching this pattern

Test plan

  • CI green
  • After deploy: monitor next Scrypt redundancy pipeline; if WS drops mid-check, verify order stays IN_PROGRESS and resumes (vs. flipping to Failed)
  • Verify no double-execution: getOrderStatus cache + 30-day fallback in scrypt.service.ts:301 already dedupes by ClOrdID, so retry-on-next-tick reuses the existing correlation
  • Optional: tail logs for Retrying fetch ... after transient error and Transient WS error checking order to gauge frequency

The Scrypt WebSocket adapter rejects all pending requests with
'Connection closed' when the WS disconnects. Previously this surfaced as
a permanent OrderFailedException in the liquidity management pipeline,
causing the rule to be paused even though the underlying order on Scrypt
was still alive (no fill, no money moved).

Two changes:

- ScryptWebSocketConnection: extend the 'fetchAll' retry pattern to also
  cover 'fetch'. Refactor the retry logic into a shared helper so any
  future fetch-style call gets the same treatment.

- ScryptAdapter.checkTradeCompletion: when the underlying error is a
  transient WS error (Connection closed / unknown reqid), return false
  instead of throwing OrderFailedException, so the order stays
  IN_PROGRESS and is retried on the next cron tick.

Reproduced via pipeline 60738 (rule 313, Scrypt/EUR redundancy): order
122805 went through 5 ClOrdIds in 3 minutes before WS dropped during a
check, was wrongly marked Failed; balance audit confirmed the EUR were
never spent.
@TaprootFreak TaprootFreak marked this pull request as ready for review April 30, 2026 10:37
…pter

Centralize the transient WS error markers ('Connection closed' /
'unknown reqid') and a shared isTransientWsError helper in
scrypt-websocket-connection. Both retryOnTransientWsError and the
ScryptAdapter check now use the same function, eliminating the
duplicated string list and aligning the case-insensitive matching with
isBalanceTooLowError elsewhere in the adapter.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant