Add DeviceClass lifecycle management to DRA reconciler#1285
Conversation
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: TomerNewman The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
|
876d626 to
4d8e7c2
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1285 +/- ##
==========================================
- Coverage 79.09% 73.69% -5.40%
==========================================
Files 51 67 +16
Lines 5109 4977 -132
==========================================
- Hits 4041 3668 -373
- Misses 882 1140 +258
+ Partials 186 169 -17 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
✅ Deploy Preview for kubernetes-sigs-kmm ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
✅ Deploy Preview for kubernetes-sigs-kmm ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
✅ Deploy Preview for kubernetes-sigs-kmm ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
/assign @yevgeny-shnaidman @ybettan |
| if mod.Spec.DRA == nil { | ||
| if len(existingDRADS) > 0 { | ||
| if err = r.reconHelperAPI.deleteDRADaemonSets(ctx, existingDRADS); err != nil { | ||
| if len(existingDRADS) > 0 || len(existingDCs) > 0 { |
There was a problem hiding this comment.
no need to check if they exist: deleteDRAResource support receiving zero-length slices
There was a problem hiding this comment.
Done, removed the len() guard.
| ) error { | ||
| logger := log.FromContext(ctx) | ||
|
|
||
| if !exists { |
There was a problem hiding this comment.
Why not use CreateOrPatch function here? That way you won't need to pass the exist and existingDC parameters, and is implified the code
There was a problem hiding this comment.
Done, switched to controllerutil.CreateOrPatch.
|
|
||
| // Create missing or patch existing DeviceClasses to match the desired spec. | ||
| for _, desired := range mod.Spec.DRA.DeviceClasses { | ||
| existingDC, exists := existingByName[desired.Name] |
There was a problem hiding this comment.
you can remove the desired from existingDC, and then in 304 you can just go over DC that are left in the existingDC
There was a problem hiding this comment.
Done, now deleting from existingByName during the loop and iterating over what remains.
4d8e7c2 to
0230174
Compare
|
/retest |
|
/lgtm |
|
/retest |
|
/hold |
| if err := drh.client.Delete(ctx, &ds); err != nil && !apierrors.IsNotFound(err) { | ||
| errs = append(errs, fmt.Errorf("failed to delete DRA DaemonSet %s/%s: %v", ds.Namespace, ds.Name, err)) | ||
| } | ||
| } |
There was a problem hiding this comment.
Can we use drh.client.DeleteAllOf(...) instead of looping?
| for _, dc := range deviceClasses { | ||
| if err := drh.client.Delete(ctx, &dc); err != nil && !apierrors.IsNotFound(err) { | ||
| errs = append(errs, fmt.Errorf("failed to delete DeviceClass %s: %v", dc.Name, err)) |
There was a problem hiding this comment.
Can we use drh.client.DeleteAllOf(...) instead of looping?
| if deleteErr := drh.client.Delete(ctx, &dc); deleteErr != nil && !apierrors.IsNotFound(deleteErr) { | ||
| errs = append(errs, fmt.Errorf("failed to delete DeviceClass %s: %v", name, deleteErr)) |
There was a problem hiding this comment.
Can we construct a list of DC to be deleted and delete all of them in a single API call?
There was a problem hiding this comment.
Unlike the full cleanup path (where DeleteAllOf works since all targets share the same labels), here we only want to delete the stale subset — not all module-owned DeviceClasses. DeleteAllOf with module labels would also remove the ones we just created/patched. Since there's no Kubernetes API to bulk-delete a specific set of named resources, individual Delete calls on the remaining map entries is the correct approach for selective deletion.
Extend the DRAReconciler to manage cluster-scoped DeviceClass resources declared in spec.dra.deviceClasses. DeviceClasses use label-based ownership (ModuleNameLabel + ModuleNamespaceLabel) since ownerReferences cannot reference namespaced resources from cluster-scoped objects. The reconciler performs declarative convergence on each cycle: - Creates missing DeviceClasses with ownership labels - Patches existing DeviceClasses to correct spec drift - Deletes stale DeviceClasses no longer in the desired list - Cleans up all DeviceClasses on Module deletion or DRA removal Errors are aggregated via errors.Join to attempt all operations before requeueing. NotFound errors on delete are tolerated to handle races.
0230174 to
675df52
Compare
|
/lgtm |
Add DeviceClass lifecycle management to DRA reconciler
Summary
Extends the DRAReconciler to manage cluster-scoped Kubernetes DeviceClass resources declared in
spec.dra.deviceClasses. Since DeviceClasses are cluster-scoped, ownership is tracked via labels (ModuleNameLabel+ModuleNamespaceLabel) rather than ownerReferences. The reconciler performs declarative convergence — creating missing, patching drifted, and deleting stale DeviceClasses each cycle.Changes
internal/constants/constants.goModuleNamespaceLabelconstant for cluster-scoped ownership trackinginternal/controllers/dra_reconciler.gogetModuleDeviceClasses— lists DeviceClasses by ownership labelshandleDeviceClasses— orchestrates create/patch/delete convergencecreateOrPatchDeviceClass— handles single DeviceClass create or patchdeleteDRAResources— unified deletion of DaemonSets and DeviceClasses witherrors.Joinaggregation andIsNotFoundtoleranceReconcileloop (deletion, nil-DRA, and happy paths)internal/controllers/dra_reconciler_test.goReconciletests for new DeviceClass expectationsDRAReconciler_getModuleDeviceClassestest blockDRAReconciler_handleDeviceClassestest block (create, patch, delete, errors, no-op)DRAReconciler_deleteDRAResourcestest block (empty, success, partial failure, aggregated errors)Testing
Reconcileat 96.2%make generate,make fmt,make vet,make unit-testall passAcceptance Criteria
ModuleNameLabel+ModuleNamespaceLabelspec.selectorsandspec.configspec.draremoval triggers cleanup of all matching DeviceClasses