DAOS-18487 object: control EC rebuild resource consumption#17441
DAOS-18487 object: control EC rebuild resource consumption#17441gnailzenh merged 36 commits intorelease/2.6from
Conversation
A degraded EC read will allocate and register an extra buffer to recover data, which may cause ENOMEM in some cases. this workaround does not prevent dynamic buffer allocation and registration, it does provide relatively precise control over the resources consumed by degraded EC reads. Signed-off-by: Liang Zhen <[email protected]>
|
Errors are Unable to load ticket data |
Signed-off-by: Liang Zhen <[email protected]>
|
Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17441/2/testReport/ |
Signed-off-by: Liang Zhen <[email protected]>
For data migration, after being waken up, the ULT should try to wake up another ULT if there is still available resource. Signed-off-by: Liang Zhen <[email protected]>
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17441/4/testReport/ |
Signed-off-by: Liang Zhen <[email protected]>
9664eb4
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17441/5/testReport/ |
|
Test stage Functional Hardware Large completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17441/5/execution/node/1541/log |
Signed-off-by: Liang Zhen <[email protected]>
- Add resource bucket so overall resource consumption wouldn't grow on system configured with more targets - Track demanded resource and waitq for blocked ULT, and wakeup as many waiters as resource(being released) allowed - Code cleanup Signed-off-by: Liang Zhen <[email protected]>
increase default resource limit Signed-off-by: Liang Zhen <[email protected]>
Signed-off-by: Liang Zhen <[email protected]>
Fix a reference leak in migrate_fini_one_ult() Signed-off-by: Liang Zhen <[email protected]>
Signed-off-by: Liang Zhen <[email protected]>
Signed-off-by: Liang Zhen <[email protected]>
Signed-off-by: Liang Zhen <[email protected]>
Signed-off-by: Liang Zhen <[email protected]>
Signed-off-by: Liang Zhen <[email protected]>
Signed-off-by: Wang Shilong <[email protected]>
Signed-off-by: Liang Zhen <[email protected]>
- hulk data handling is not required anymore, it's replaced by starveling mechanism - remove the "yield" and simplify code Signed-off-by: Liang Zhen <[email protected]>
Signed-off-by: Liang Zhen <[email protected]>
If a rebuild hang is detected, dump resource bucket information and the waiter queue head Signed-off-by: Wang Shilong <[email protected]>
Signed-off-by: Wang Shilong <[email protected]>
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17441/22/testReport/ |
Signed-off-by: Liang Zhen <[email protected]>
- remove private resource - add hulk data back, but as a separate resource type - other cleanups Signed-off-by: Liang Zhen <[email protected]>
Signed-off-by: Liang Zhen <[email protected]>
Signed-off-by: Wang Shilong <[email protected]>
Signed-off-by: Wang Shilong <[email protected]>
|
Test stage Functional Hardware Medium Verbs Provider completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17441/30/testReport/ |
|
Test stage Functional Hardware Medium Verbs Provider completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17441/31/testReport/ |
|
This PR was merged with lint failures and now that failure is going to appear on every 2.6 PR and landing runs for 2.6 |
|
I pushed #17763 to fix |
A degraded EC read will allocate and register an extra buffer to recover data, which may cause ENOMEM in some cases.
this workaround does not prevent dynamic buffer allocation and registration, it does provide relatively precise control over the resources consumed by degraded EC reads.
Steps for the author:
After all prior steps are complete: