-
Notifications
You must be signed in to change notification settings - Fork 2
feat: add socket auditor for forwarding logs to coder agent #124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
fac8048 to
a2ea4f9
Compare
Add SocketAuditor that sends audit logs to the Coder workspace agent via a Unix socket. This enables boundary audit events to be forwarded to coderd for centralized logging. Implementation notes: - Batching: 10 logs or 5 seconds, whichever comes first - Wire format: tag & length prefixed protobuf. proto imported from AgentAPI to simplify boundary -> agent -> coderd forwarding to start. - CLI and config flag to disable sending of audit logs to workspace agent as an escape hatch
a2ea4f9 to
2365931
Compare
dannykopping
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spotted two fairly significant shortcomings which I think need to be addressed before this can land.
I'm not going to block merge because this is not my project, but I highly recommend these be addressed before proceeding.
| defaultBatchSize = 10 | ||
| defaultBatchTimerDuration = 5 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic numbers should be documented please.
| defaultBatchTimerDuration = 5 * time.Second | ||
| // DefaultAuditSocketPath is the well-known path for the boundary audit socket. | ||
| // The expectation is the Coder agent listens on this socket to receive audit logs. | ||
| DefaultAuditSocketPath = "/tmp/boundary-audit.sock" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/tmp is not guaranteed to exist.
| return &flushErr{err: err, permanent: true} | ||
| } | ||
|
|
||
| if len(data) > 1<<28 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use constants instead of magic numbers, and document how this relates to the wire protocol.
| } | ||
|
|
||
| if err := flush(conn, batch); err != nil { | ||
| s.logger.Warn("failed to flush audit logs", "error", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
| s.logger.Warn("failed to flush audit logs", "error", err) | |
| s.logger.Warn("failed to flush audit logs", slog.Error(err)) |
| s.logger.Warn("failed to flush audit logs", "error", err) | ||
| if err.permanent { | ||
| // Data error: discard batch to avoid infinite retries. | ||
| clearBatch() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be logged.
| if len(batch) >= s.batchSize { | ||
| doFlush() | ||
| if len(batch) >= s.batchSize { | ||
| s.logger.Warn("audit log dropped, batch full") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should report something up to the agent in this case.
If operators never look at the boundary logs, they'll never know that logs have been silently dropped.
The agent could receive a payload which informs it of how many were dropped and export a metric.
| } | ||
| connect() | ||
| if conn == nil { | ||
| // No connection: logs will be retried on next flush. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be logged.
| doFlush() | ||
| closeConn() | ||
| return | ||
| case <-t.C: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: t := time.NewTimer(0) will tick immediately, so if the batch is empty this will be a noop.
| connect() | ||
| if conn == nil { | ||
| // No connection: logs will be retried on next flush. | ||
| return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t.Stop was called upfront, so when you return here you won't get any further ticks until the buffer is full.
This could lead to buffered logs being stuck until the buffer fills up, which may be a very long time.
|
|
||
| // startTestServer starts a Unix socket server that reads length-prefixed protobuf messages, | ||
| // and reports all received requests to the given channel. | ||
| func startTestServer(t *testing.T, socketPath string, received chan<- *agentproto.ReportBoundaryLogsRequest) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will undoubtably lead to inconsistencies.
I'd suggest extracting the message framing code into its own package which can be reused; that way you centralize the core logic.
func WriteFrame(w io.Writer, tag byte, data []byte) error
func ReadFrame(r io.Reader, maxSize int) (tag byte, data []byte, err error)Then in this test you can simply set up a net.Pipe instead of leaking implementation detail from upstream.
And likewise in the related coder PR you don't have to leak any details from this library (agent/boundary_logs_test.go -> sendBoundaryLogsRequest).
Add SocketAuditor that sends audit logs to the Coder workspace agent via a Unix socket. This enables boundary audit events to be forwarded to coderd for centralized logging.
Features:
RFC: https://www.notion.so/coderhq/Agent-Boundary-Logs-2afd579be59280f29629fc9823ac41ba?pvs=23
Corresponding PR in coder/coder coder/coder#21345
coder/coder#21280