feat: Readiness endpoint#2188
Conversation
There was a problem hiding this comment.
Something that bothers me here though, is that the load-balancer must also serve this endpoint 🤔
But I suppose that they would just forward it, and if anything is available it returns Ok, and if nothing is available they wouldn't have a service. So maybe it just works :)
|
After some digging, it looks like gRPC has its own health check service defined. We should probably consider that instead since it probably is supported by load-balancers out the box (..maybe). https://github.com/grpc/grpc/blob/master/doc/health-checking.md |
The way I understand it, its still just the same except we return an enum instead of the status code. And since each server effectively only has a single gRPC service (+ health service I guess), we only need to support the empty, health (always returns healthy), and the specific gRPC service the server implements? If we do this only for nodes then its always: |
| const POLL_INTERVAL: Duration = Duration::from_secs(30); | ||
|
|
||
| loop { | ||
| let is_ready = match source_rpc.status(Request::new(())).await { | ||
| Ok(response) => { | ||
| let upstream_tip = | ||
| response.into_inner().block_producer.map_or(u32::MAX, |b| b.chain_tip); | ||
| let local_tip = store.chain_tip(Finality::Committed).await.as_u32(); | ||
| upstream_tip.saturating_sub(local_tip) <= readiness_threshold | ||
| }, | ||
| Err(_) => false, | ||
| }; | ||
|
|
||
| if is_ready { | ||
| reporter.set_serving::<api_server::ApiServer<api::RpcService>>().await; | ||
| } else { | ||
| reporter.set_not_serving::<api_server::ApiServer<api::RpcService>>().await; | ||
| } | ||
|
|
||
| tokio::time::sleep(POLL_INTERVAL).await; |
There was a problem hiding this comment.
Just FYI, this is why I don't want to use rocksdb as the source of truth.
This would be much easier if we had a chain tip watch channel to monitor instead.
| let local_tip = store.chain_tip(Finality::Committed).await.as_u32(); | ||
| upstream_tip.saturating_sub(local_tip) <= readiness_threshold | ||
| }, | ||
| Err(_) => false, |
There was a problem hiding this comment.
I think this is a problem? This means a single request failure from upstream removes this node entirely.
There was a problem hiding this comment.
Added a threshold of const 3. LMK thoughts
| }; | ||
| self.reporter.set_service_status(Self::SERVICE_NAME, status).await; | ||
| } | ||
| } |
There was a problem hiding this comment.
@Mirko-von-Leipzig Wondering why the rpc sync code is in block producer crate. Maybe in followup I can look at moving the sync related stuff out? Unsure
Relates to #2176.
Adds support for
grpc.health.v1.Health/Checkin RPC server.Results locally: