Upon joining a cluster, I noticed that the actual announcement is significantly delayed. In particular, the debug output shows that a node has joined a full 20 seconds before the OnNodeJoin or OnNewLeaves events are fired.
I tracked this down to line 646 of cluster.go. My understanding of reading through the code and the debug output is that the join process goes like this: (please slap me if I've gone awry)
- send join message to cluster
- cluster accepts join message and node into cluster (debug output says "Node ... joined!")
- cluster sends state tables back to the new node
- the new node doesn't know it has successfully joined yet, waits for
2 * NETWORK_TIMEOUT and announces presence
- after announcement is sent, the node proclaims itself joined
- the
OnNodeJoin and OnNewLeaves events are fired for all nodes in the cluster
I'm guessing that there is a reason why there is a delay set to 2 * NETWORK_TIMEOUT, but I'm not sure what it is. (Truthfully, my networking skills are pretty poor, so I dare not hazard a guess.)
I would be very happy to work on a fix for this problem, I'm just not sure what the fix would look like yet. Therefore, I am seeking guidance. :-)
My inclination is to try and announce the node's presence immediately, and if it fails, try again after a longer timeout. I just don't know what if it fails means in this context.
Thanks!
Upon joining a cluster, I noticed that the actual announcement is significantly delayed. In particular, the debug output shows that a node has joined a full 20 seconds before the
OnNodeJoinorOnNewLeavesevents are fired.I tracked this down to line 646 of
cluster.go. My understanding of reading through the code and the debug output is that the join process goes like this: (please slap me if I've gone awry)2 * NETWORK_TIMEOUTand announces presenceOnNodeJoinandOnNewLeavesevents are fired for all nodes in the clusterI'm guessing that there is a reason why there is a delay set to
2 * NETWORK_TIMEOUT, but I'm not sure what it is. (Truthfully, my networking skills are pretty poor, so I dare not hazard a guess.)I would be very happy to work on a fix for this problem, I'm just not sure what the fix would look like yet. Therefore, I am seeking guidance. :-)
My inclination is to try and announce the node's presence immediately, and
if it fails, try again after a longer timeout. I just don't know whatif it failsmeans in this context.Thanks!