Skip to content

Commit c53bfe6

Browse files
BoykoNeovclaude
andcommitted
Re-arm shell ZMQStream read after out-of-band reply send
On Windows, ipykernel 7 intermittently drops an execute_request on the shell channel: the kernel goes idle and never replies, and the client times out waiting for execute_reply (~30% of headless notebook runs in our measurements; which cell hangs wanders run to run). Root cause: the shell ROUTER socket is dual-use on the shell-channel thread. A ZMQStream reads execute_requests off it, while replies are sent back over the SAME socket out-of-band via a raw send_multipart in SubshellManager._send_on_shell_channel. That out-of-band send drains the socket's edge-triggered ZMQ_FD read edge (a documented libzmq corollary: after zmq_send the socket may become readable without a new edge). The send is not ZMQStream-mediated, so the stream is never re-armed and a request that arrived concurrently strands unread on a registered-but- non-readable fd. The strand is terminal: no later arrival re-edges it. Fix: after each out-of-band reply send, schedule the shell ZMQStream's read handler on the shell-channel loop -- the same edge-trap reschedule ZMQStream._update_handler already runs internally (add_callback(lambda: stream._handle_events(stream.socket, 0))) -- so the concurrently-arrived request cannot strand. The shell_stream (built in kernelapp.init_kernel) is threaded through ShellChannelThread into SubshellManager so the reply path can reach it. Validated on Windows (Python 3.13/3.14, pyzmq 27.1.0 / libzmq 4.3.5): the wedge went from 6/20 (control) to 0/20 with this patch applied, same machine/session, P(0/20 | p=0.30) ~ 8e-4, with the threaded reference live on every send (551 re-arms, 0 None/mismatch). A sham arm with the same scheduling overhead but no re-arm stayed at the control rate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1 parent 821f6c0 commit c53bfe6

3 files changed

Lines changed: 21 additions & 0 deletions

File tree

ipykernel/kernelapp.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -608,6 +608,9 @@ def init_kernel(self):
608608
"""Create the Kernel object itself"""
609609
if self.shell_channel_thread:
610610
shell_stream = ZMQStream(self.shell_socket, self.shell_channel_thread.io_loop)
611+
# Hand the stream to the shell-channel thread so SubshellManager can re-arm
612+
# the read after each out-of-band reply send (the wedge fix).
613+
self.shell_channel_thread.shell_stream = shell_stream
611614
else:
612615
shell_stream = ZMQStream(self.shell_socket)
613616
control_stream = ZMQStream(self.control_socket, self.control_thread.io_loop)

ipykernel/shellchannel.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,10 @@ def __init__(
2929
self._manager: SubshellManager | None = None
3030
self._zmq_context = context # Avoid use of self._context
3131
self._shell_socket = shell_socket
32+
# Set by kernelapp.init_kernel after it builds the shell ZMQStream (this thread
33+
# is created before the stream). Threaded into SubshellManager so it can re-arm
34+
# the read after each out-of-band reply send (the wedge fix).
35+
self.shell_stream = None
3236
# Record the parent thread - the thread that started the app (usually the main thread)
3337
self.parent_thread = current_thread()
3438

@@ -43,6 +47,7 @@ def manager(self) -> SubshellManager:
4347
self._zmq_context,
4448
self.io_loop,
4549
self._shell_socket,
50+
self.shell_stream,
4651
)
4752
return self._manager
4853

ipykernel/subshell_manager.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,13 +40,18 @@ def __init__(
4040
context: zmq.Context[t.Any],
4141
shell_channel_io_loop: IOLoop,
4242
shell_socket: zmq.Socket[t.Any],
43+
shell_stream=None,
4344
):
4445
"""Initialize the subshell manager."""
4546
self._parent_thread = current_thread()
4647

4748
self._context: zmq.Context[t.Any] = context
4849
self._shell_channel_io_loop = shell_channel_io_loop
4950
self._shell_socket = shell_socket
51+
# ZMQStream that reads `shell_socket`. Replies are sent on this same socket
52+
# out-of-band below, draining its edge-triggered ZMQ_FD read edge; re-arm the
53+
# stream after each send so a concurrently-arrived request cannot strand unread.
54+
self._shell_stream = shell_stream
5055
self._cache: dict[str, SubshellThread] = {}
5156
self._lock_cache = Lock() # Sync lock across threads when accessing cache.
5257

@@ -226,6 +231,14 @@ def _process_control_request(
226231
def _send_on_shell_channel(self, msg) -> None:
227232
assert current_thread().name == SHELL_CHANNEL_THREAD_NAME
228233
self._shell_socket.send_multipart(msg)
234+
# Re-arm the shell ZMQStream read: this out-of-band send drained the ROUTER's
235+
# edge-triggered ZMQ_FD read edge; reschedule the stream's handler so a
236+
# concurrently-arrived request cannot strand unread (the wedge fix).
237+
stream = self._shell_stream
238+
if stream is not None and stream.socket is self._shell_socket:
239+
self._shell_channel_io_loop.add_callback(
240+
lambda: stream._handle_events(stream.socket, 0)
241+
)
229242

230243
def _stop_subshell(self, subshell_thread: SubshellThread) -> None:
231244
"""Stop a subshell thread and close all of its resources."""

0 commit comments

Comments
 (0)