Prototype audit log improvements + coverage test #9467

david-crespo · 2025-12-03T21:36:45Z

This is a draft because I want to get reactions to the approach. Would address #8820, #8811, and #8819. The idea is:

An ergonomic helper audit_and_time that
a. Wraps up the two annoying audit log calls — init and complete — into a single function that takes the handler logic as a callback
b. Combines that with the latency timing function, because it's ugly to do both
A test that uses the VERIFY_ENDPOINTS list to make sure every non-get endpoint gets an audit log entry when you call it

The only added functionality here is the use of a trait MaybeHasResourceId (could rename to AuditResponse or something) to extract the ID of created resources so we can log them. Note that this PR doesn't do any of the DB work yet to actually store that ID — I want to make sure people like the callsite ergonomics first.

Concerns and future plans

Here I am specifically pulling out the ID of the created resource, but in Audit log: log arbitrary data about request and response bodies #8821 we discuss wanting to log the entire response body, and that would include the ID and be strictly more general. So we might want to skip straight to that if we expect to do it eventually. Added some notes on handling giant responses here.
This doesn't address Audit log: log ID of token or session used for authn #8813 but that issue doesn't have anything to do with the helper — we'll have to store the credential ID in the opctx and pull it off of there in the audit log entry init call

david-crespo · 2025-12-03T21:37:34Z

nexus/tests/output/uncovered-audit-log-endpoints.txt

@@ -0,0 +1,123 @@
+Mutating endpoints without audit logging:


This file would be empty or nearly empty in the complete version of this PR.

nexus/src/external_api/http_entrypoints.rs

-            .context
-            .external_latencies
-            .instrument_dropshot_handler(&rqctx, handler)
+            })


test for verifying audit log coverage

david-crespo · 2025-12-03T22:01:08Z

nexus/src/external_api/http_entrypoints.rs

+                nexus.scim_idp_create_token(&opctx, &silo_lookup).await?;
+            Ok(HttpResponseCreated(token))
+        })
+        .await


The ergonomics comparison here is.... favorable.

inickles

Some of what I have below was discussed offline, but I'll mention here for posterity.

The callsite ergonomics look better, though I'm wondering if what's here will be the full story considering I think we'll want more than a top-level resource_id field for the audit event (which I think it's what's proposed here?). I think there could be a number of things we'd want to include an audit event, and that each endpoint will need to define a what it wants to include in the audit log, which could be a subset of the elements returned in the response body itself. And so I think that means audit_and_time would need more than just a handler, or perhaps the handler types must implement functions that define what that audit log response body would be?

I don't think entire request or response bodies should be automatically added to the audit log, as some data isn't relevant or helpful to an audit log, such as block data, which could blow up the size of the log.

For example, a single API call to create an instance can be used to also create new disk and network interface, and ideally all of those new created resource IDs are in the audit log.

I could see similar for request parameters as well, mostly for the cases where interesting request params are in POST body data and not available in the URL. Though, for the sake of parsing, I think I would prefer to for some of that data to also be available in the audit event body that I could parse with just JSON and not also have to parse a URL.

I think it's common for API logs to do this kind of thing, include a response body element that has response data specific to the endpoint.

inickles · 2025-12-09T23:34:53Z

nexus/src/context.rs

+// so names don't actually identify things uniquely the way IDs do. So we may
+// end up needing to record the ID for delete or update operations as well.


That's a good point.

inickles · 2025-12-09T23:37:53Z

nexus/src/context.rs

+            // TODO: pass resource_id to audit_log_entry_complete once
+            // the schema supports it
+            let _resource_id =
+                result.as_ref().ok().and_then(|r| r.resource_id());


The idea being that resource_id is a top-level field of an audit event?

That was the idea here. Another way would be to have a general concept of response summary, like we talked about, probably as a JSON column so it could be free-form, and that would have the ID in it. The problem there is that the summary column would have no guaranteed schema because different things might get different summaries, and if the ID is going to be especially load-bearing, it would probably make sense to give it its own column. Especially if we're recording it for deletes and updates too, like my code comment mentions. In that case nearly every audit log entry would have a resource ID associated with it.

Though something that just occurred to me is that it is not necessarily always obvious what the ID is of. If you're doing a project update, then it's obvious that it would be the ID of the project. But if you update the firewall rules on a VPC, the ID would be that of the VPC because there is no ID associated with the firewall rules specifically.

I wonder if I should include a resource type string indicating what kind of ID it is. That is kind of painful to think about, but we do probably have a string like that around for every resource (see ResourceType below). I'm going to have to do a kind of audit of all the endpoints to see if problematic cases are the rule or the exception.

omicron/common/src/api/external/mod.rs

Lines 933 to 1013 in 1c69391

#[display(style = "kebab-case")]

pub enum ResourceType {

AddressLot,

AddressLotBlock,

AffinityGroup,

AffinityGroupMember,

Alert,

AlertReceiver,

AllowList,

AntiAffinityGroup,

AntiAffinityGroupMember,

AuditLogEntry,

BackgroundTask,

BgpAnnounceSet,

BgpConfig,

Blueprint,

Certificate,

ConsoleSession,

Dataset,

DeviceAccessToken,

DeviceAuthRequest,

Disk,

Fleet,

FloatingIp,

IdentityProvider,

Image,

Instance,

InstanceNetworkInterface,

InternetGateway,

InternetGatewayIpAddress,

InternetGatewayIpPool,

IpPool,

IpPoolResource,

LldpLinkConfig,

LoopbackAddress,

MetricProducer,

MulticastGroup,

MulticastGroupMember,

NatEntry,

Oximeter,

PhysicalDisk,

Probe,

ProbeNetworkInterface,

Project,

ProjectImage,

Rack,

RoleBuiltin,

RouterRoute,

SagaDbg,

SamlIdentityProvider,

ScimClientBearerToken,

Service,

ServiceNetworkInterface,

Silo,

SiloAuthSettings,

SiloGroup,

SiloImage,

SiloQuotas,

SiloUser,

Sled,

SledInstance,

SledLedger,

Snapshot,

SshKey,

SupportBundle,

Switch,

SwitchPort,

SwitchPortSettings,

TufArtifact,

TufRepo,

TufTrustRoot,

UserBuiltin,

Vmm,

Volume,

Vpc,

VpcFirewallRule,

VpcRouter,

VpcSubnet,

WebhookSecret,

Zpool,

}

That was the idea here. Another way would be to have a general concept of response summary, like we talked about, probably as a JSON column so it could be free-form, and that would have the ID in it.

Would it have to be free form (which would mean to use serde_json::Value?)? I'd think we could do something like:

pub enum ResponseData { CreateInstance { instance_id: Uuid, disk_id: Option<Uuid>, network_interface_id: Option<Uuid> } }

At which point we might be able to do adjacent tagging on the operation_id:

#[serde(tag = "operation_id", content: "response_data")]`

Tagging specifics aside, I think still don't see how the single top-level resource id and type (considering we add that) will address the example of the instance create call, where multiple resources could be created.

That is a great point, hadn't thought about create requests that create multiple things. I was only thinking about the case where the endpoint modifies something without its own ID.

A summary like the one you suggest would be required to get a complete picture. Instance disks are even more complicated than your example suggests, because you can give a list of disks, some of which already exist and some of which get created. Do you list only things that got created, or do you list created and attached disks separately? Interestingly, even logging the full response object would not solve this because disks and NICs are not included in that response struct (docs) or in the instance DB model. So building the desired summary here would require quite a bit of extra plumbing. One way around that would be to treat the list of disks and NICs as part of a request summary instead of a response summary. One problem with that, which I've pointed out elsewhere, is that we get disk names in the request instead of resolved IDs.

I had Claude go through and make a report about every write operation, highlighting the weird ones.

https://gist.github.com/david-crespo/b459ee1e7b68a2fae16d5ebb64869d51

Pretty nifty. An excerpt:

Special Cases Requiring Attention

Operation Challenge

instance_create NICs, disks, and IPs created are not in response. Must capture from saga outputs.

vpc_firewall_rules_update Bulk replace - need IDs of both deleted and created rules.

project_create VPC and all children created are not in response (only project returned).

vpc_create Router, routes, subnet, gateway not in response.

policy_update variants Bulk replace - need IDs of deleted and created role assignments.

networking_bgp_announce_set_update Bulk replace - need IDs of announcements being replaced.

david-crespo commented Dec 3, 2025

View reviewed changes

nexus/src/external_api/http_entrypoints.rs Outdated

.context

.external_latencies

.instrument_dropshot_handler(&rqctx, handler)

})

This comment was marked as outdated.

Sign in to view

prototype audit log improvements + coverage test

79c0228

test for verifying audit log coverage

david-crespo force-pushed the audit-log-redux branch from c07a3ff to 79c0228 Compare December 3, 2025 21:49

david-crespo commented Dec 3, 2025

View reviewed changes

inickles self-requested a review December 5, 2025 18:27

inickles reviewed Dec 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prototype audit log improvements + coverage test #9467

Prototype audit log improvements + coverage test #9467

Uh oh!

david-crespo commented Dec 3, 2025 •

edited

Loading

Uh oh!

david-crespo Dec 3, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

david-crespo Dec 3, 2025

Uh oh!

inickles left a comment

Uh oh!

inickles Dec 9, 2025

Uh oh!

inickles Dec 9, 2025

Uh oh!

david-crespo Dec 10, 2025

Uh oh!

david-crespo Dec 10, 2025 •

edited

Loading

Uh oh!

inickles Dec 15, 2025

Uh oh!

david-crespo Dec 16, 2025 •

edited

Loading

Uh oh!

david-crespo Dec 17, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		// so names don't actually identify things uniquely the way IDs do. So we may
		// end up needing to record the ID for delete or update operations as well.

	#[display(style = "kebab-case")]
	pub enum ResourceType {
	AddressLot,
	AddressLotBlock,
	AffinityGroup,
	AffinityGroupMember,
	Alert,
	AlertReceiver,
	AllowList,
	AntiAffinityGroup,
	AntiAffinityGroupMember,
	AuditLogEntry,
	BackgroundTask,
	BgpAnnounceSet,
	BgpConfig,
	Blueprint,
	Certificate,
	ConsoleSession,
	Dataset,
	DeviceAccessToken,
	DeviceAuthRequest,
	Disk,
	Fleet,
	FloatingIp,
	IdentityProvider,
	Image,
	Instance,
	InstanceNetworkInterface,
	InternetGateway,
	InternetGatewayIpAddress,
	InternetGatewayIpPool,
	IpPool,
	IpPoolResource,
	LldpLinkConfig,
	LoopbackAddress,
	MetricProducer,
	MulticastGroup,
	MulticastGroupMember,
	NatEntry,
	Oximeter,
	PhysicalDisk,
	Probe,
	ProbeNetworkInterface,
	Project,
	ProjectImage,
	Rack,
	RoleBuiltin,
	RouterRoute,
	SagaDbg,
	SamlIdentityProvider,
	ScimClientBearerToken,
	Service,
	ServiceNetworkInterface,
	Silo,
	SiloAuthSettings,
	SiloGroup,
	SiloImage,
	SiloQuotas,
	SiloUser,
	Sled,
	SledInstance,
	SledLedger,
	Snapshot,
	SshKey,
	SupportBundle,
	Switch,
	SwitchPort,
	SwitchPortSettings,
	TufArtifact,
	TufRepo,
	TufTrustRoot,
	UserBuiltin,
	Vmm,
	Volume,
	Vpc,
	VpcFirewallRule,
	VpcRouter,
	VpcSubnet,
	WebhookSecret,
	Zpool,
	}

Operation	Challenge
`instance_create`	NICs, disks, and IPs created are not in response. Must capture from saga outputs.
`vpc_firewall_rules_update`	Bulk replace - need IDs of both deleted and created rules.
`project_create`	VPC and all children created are not in response (only project returned).
`vpc_create`	Router, routes, subnet, gateway not in response.
`policy_update` variants	Bulk replace - need IDs of deleted and created role assignments.
`networking_bgp_announce_set_update`	Bulk replace - need IDs of announcements being replaced.

Prototype audit log improvements + coverage test #9467

Are you sure you want to change the base?

Prototype audit log improvements + coverage test #9467

Uh oh!

Conversation

david-crespo commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Concerns and future plans

Uh oh!

david-crespo Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

david-crespo Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

inickles left a comment

Choose a reason for hiding this comment

Uh oh!

inickles Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

inickles Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

david-crespo Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

david-crespo Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

inickles Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

david-crespo Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david-crespo Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Special Cases Requiring Attention

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

david-crespo commented Dec 3, 2025 •

edited

Loading

david-crespo Dec 10, 2025 •

edited

Loading

david-crespo Dec 16, 2025 •

edited

Loading

david-crespo Dec 17, 2025 •

edited

Loading