Skip to content

Conversation

@zedkipp
Copy link
Contributor

@zedkipp zedkipp commented Dec 19, 2025

Add SocketAuditor that sends audit logs to the Coder workspace agent via a Unix socket. This enables boundary audit events to be forwarded to coderd for centralized logging.

Features:

  • Batching: 10 logs or 5 seconds, whichever comes first
  • Wire format: length-prefixed protobuf (proto imported from AgentAPI) to make boundary -> agent -> coderd simple to start

RFC: https://www.notion.so/coderhq/Agent-Boundary-Logs-2afd579be59280f29629fc9823ac41ba?pvs=23
Corresponding PR in coder/coder coder/coder#21345
coder/coder#21280

@zedkipp zedkipp force-pushed the zedkipp/socket-auditor branch 3 times, most recently from fac8048 to a2ea4f9 Compare December 19, 2025 21:37
@zedkipp zedkipp marked this pull request as ready for review December 23, 2025 21:39
Add SocketAuditor that sends audit logs to the Coder workspace agent
via a Unix socket. This enables boundary audit events to be forwarded
to coderd for centralized logging.

Implementation notes:
- Batching: 10 logs or 5 seconds, whichever comes first
- Wire format: tag & length prefixed protobuf. proto imported from AgentAPI to
  simplify boundary -> agent -> coderd forwarding to start.
- CLI and config flag to disable sending of audit logs to workspace agent
  as an escape hatch
@zedkipp zedkipp force-pushed the zedkipp/socket-auditor branch from a2ea4f9 to 2365931 Compare December 24, 2025 00:13
Copy link

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spotted two fairly significant shortcomings which I think need to be addressed before this can land.

I'm not going to block merge because this is not my project, but I highly recommend these be addressed before proceeding.

Comment on lines +18 to +19
defaultBatchSize = 10
defaultBatchTimerDuration = 5 * time.Second

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic numbers should be documented please.

defaultBatchTimerDuration = 5 * time.Second
// DefaultAuditSocketPath is the well-known path for the boundary audit socket.
// The expectation is the Coder agent listens on this socket to receive audit logs.
DefaultAuditSocketPath = "/tmp/boundary-audit.sock"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/tmp is not guaranteed to exist.

return &flushErr{err: err, permanent: true}
}

if len(data) > 1<<28 {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use constants instead of magic numbers, and document how this relates to the wire protocol.

}

if err := flush(conn, batch); err != nil {
s.logger.Warn("failed to flush audit logs", "error", err)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
s.logger.Warn("failed to flush audit logs", "error", err)
s.logger.Warn("failed to flush audit logs", slog.Error(err))

s.logger.Warn("failed to flush audit logs", "error", err)
if err.permanent {
// Data error: discard batch to avoid infinite retries.
clearBatch()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be logged.

if len(batch) >= s.batchSize {
doFlush()
if len(batch) >= s.batchSize {
s.logger.Warn("audit log dropped, batch full")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should report something up to the agent in this case.
If operators never look at the boundary logs, they'll never know that logs have been silently dropped.
The agent could receive a payload which informs it of how many were dropped and export a metric.

}
connect()
if conn == nil {
// No connection: logs will be retried on next flush.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be logged.

doFlush()
closeConn()
return
case <-t.C:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: t := time.NewTimer(0) will tick immediately, so if the batch is empty this will be a noop.

connect()
if conn == nil {
// No connection: logs will be retried on next flush.
return

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

‼️ t.Stop was called upfront, so when you return here you won't get any further ticks until the buffer is full.

This could lead to buffered logs being stuck until the buffer fills up, which may be a very long time.


// startTestServer starts a Unix socket server that reads length-prefixed protobuf messages,
// and reports all received requests to the given channel.
func startTestServer(t *testing.T, socketPath string, received chan<- *agentproto.ReportBoundaryLogsRequest) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

‼️ This is a red flag for me; you're reimplementing the upstream server for your tests.
This will undoubtably lead to inconsistencies.

I'd suggest extracting the message framing code into its own package which can be reused; that way you centralize the core logic.

  func WriteFrame(w io.Writer, tag byte, data []byte) error
  func ReadFrame(r io.Reader, maxSize int) (tag byte, data []byte, err error)

Then in this test you can simply set up a net.Pipe instead of leaking implementation detail from upstream.
And likewise in the related coder PR you don't have to leak any details from this library (agent/boundary_logs_test.go -> sendBoundaryLogsRequest).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants