feat(v2:telemetry): more metrics & grafana dashboards#1442
Conversation
93d1755 to
ceafbbe
Compare
3b8edf7 to
a88dab6
Compare
a88dab6 to
3942c1b
Compare
Codecov Report❌ Patch coverage is
❌ Your patch check has failed because the patch coverage (0.00%) is below the target coverage (85.00%). You can increase the patch coverage or adjust the target coverage.
... and 2 files with indirect coverage changes 🚀 New features to boost your workflow:
|
3942c1b to
532c76e
Compare
ceafbbe to
48c4947
Compare
dbe1c5a to
a6c99dc
Compare
48c4947 to
32475ad
Compare
a6c99dc to
dce1e5d
Compare
d0d11ab to
8407233
Compare
96c8137 to
24b61de
Compare
563c1e6 to
08f7192
Compare
| /// Equivalent to `Logger`, but with a runtime `scope`. | ||
| /// This is intended as the parameter of "leaf" or "branch" functions that want to inherit the | ||
| /// caller's scope; it is for this reason that it can only convert to and from an actual scoped | ||
| /// logger, and not another `AnyLogger`. | ||
| pub const AnyLogger = struct { |
There was a problem hiding this comment.
Is this abstraction needed yet? I don't see any usages of AnyLogger outside this file.
My preference would be for a log message's scope to reflect the location where it is emitted. That matches how loggers commonly behave across languages, and it gives readers a stable, predictable signal about where the log originated.
I agree there can be value in adding tracing context, especially for general-purpose functions called from many places. I just would not overload scope to serve both purposes, since that makes the field ambiguous for people reading the logs. I would rather keep scope as the source location and add a separate contextual field for tracing or caller-specific information. That preserves a conventional and precise meaning for scope while still supporting the additional context this is aiming for.
Given that, I think it would be simpler to remove this abstraction for now and keep scope static. Supporting both static and dynamic scope adds another layer of complexity that callers and readers need to reason about. We can add dynamic context fields once there is a clear need for them.
There was a problem hiding this comment.
I originally did have a change that used this, but it was swept away after a rebase; the main reason I introduced it was because there was a general-purpose function in gossip which was called from many contexts, and in fact its behaviour depended explicitly on who the caller was (it had an enum that directly said who the caller was).
But yeah, given it isn't needed, I have no issue removing it, long live YAGNI.
There was a problem hiding this comment.
Dropped the commit
301146b to
004ff44
Compare
|
Not related to the changes in this PR directly, but while testing the new grafana setup locally I found that prometheus was rejecting/not decoding the histogram metrics ( I think this has to do with a formatting issue here: sig/v2/lib/telemetry/prometheus.zig Line 85 in 004ff44 Changing that line to include the outer double-quotes in the payload seems to fix this: try w.print("le=\"{f}\"", .{numberFmt(bucket.upper_bound)});versions i'm using in case it has something to do with it: Update: it seems like my prometheus setup can't scrape any metrics unless I make the above code change Another update: I did sidestep the docker setup and just ran prometheus and grafana services directly for testing things on my end. Maybe this isn't an issue in proper setups, not something that is blocking for me 🫡 Another update after the other update: it seems like prometheus scrape fails without the double quote change. From the spec it seems like |
prestonsn
left a comment
There was a problem hiding this comment.
LGTM overall, just a few minor things I noticed
| /// Queries the system for the current moment of time as an Instant. | ||
| /// This is not guaranteed to be monotonic or steadily increasing, but for | ||
| /// most implementations it is. | ||
| /// Returns `error.Unsupported` when a suitable clock is not detected. | ||
| pub fn now() Instant { | ||
| const clock_id: std.posix.CLOCK = .BOOTTIME; | ||
| const ts = std.posix.clock_gettime(clock_id) catch |err| { | ||
| std.debug.panic("clock_gettime unsupported: {}", .{err}); | ||
| }; | ||
| return .{ .timestamp = ts }; | ||
| } |
There was a problem hiding this comment.
I like the Instant and Timer abstractions a lot. I've got a couple questions tho:
- Is there a reason why we don't use
.MONOTONIChere? last i checked, the std lib uses it too (but perhaps that's part of the reason why its getting overhauled?) - Could we integrate/use
lib.clockhere in some way? It has the warmup path already setup. Wondering if this separation was intended
There was a problem hiding this comment.
On point 1, not sure iiuc. Are you saying stdlib uses MONOTONIC? Because that may be true for something else, but the Instant abstraction uses BOOTTIME on linux. If you're suggesting replacing BOOTTIME to stray from what stdlib was using, then sure I suppose, I think it probably doesn't matter a huge amount in our case.
On point 2, yeah, I implemented this before lib.clock got implemented. I'll give this another pass and probably move this into there.
There was a problem hiding this comment.
on 1, I see your point actually. I was mostly curious as why we chose BOOTTIME here. I checked the zig std lib and was totally wrong on stating it uses MONOTONIC, I thought it did for some reason. I'm cool with sticking with BOOTTIME for this
004ff44 to
9f965e6
Compare
Gossip's "onDiscovered" & shred receiver's packet logs are pretty spammy.
Mark some paths as unreachable, and be a bit more error-resistant paths
Usually the service name
Mostly a stripped down copy of v1's
9f965e6 to
4cf6876
Compare
No description provided.