The 255 granularity is from recent Thunk change to increase CWSR area granularity.
Thanks for catching this with kernel debug option CC_HAS_UBSAN_ARRAY_BOUNDS enabled. Because 1<<prange->granularity is used in many places, I think the proper fix should be in function svm_range_apply_attrs
case KFD_IOCTL_SVM_ATTR_GRANULARITY:
- prange->granlarity = attrs[i].value;
+ prange->granlarity = attrs[i].value & 0x3F;
BTW, function svm_range_split_by_granularity() is not used anymore, forgot the remove it, maybe you are testing on older source code?
Regards,
Philip
On 2023-10-18 09:36, Zhang, Yifan
wrote:
[AMD Official Use Only - General] Hi Jesse, This patch is only a WA for the error log. How is this issue reproduced ? 255 looks like an invalid value for a prange->granularity, it is better to root cause who set it in the first place. BRs, Yifan -----Original Message----- From: Jesse Zhang <jesse.zhang@xxxxxxx> Sent: Wednesday, October 18, 2023 3:50 PM To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Yang, Philip <Philip.Yang@xxxxxxx>; Zhang, Yifan <Yifan1.Zhang@xxxxxxx>; Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx>; Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx> Subject: [PATCH] drm/amdkfd: Fix shift out-of-bounds issue [ 567.613292] shift exponent 255 is too large for 64-bit type 'long unsigned int' [ 567.614498] CPU: 5 PID: 238 Comm: kworker/5:1 Tainted: G OE 6.2.0-34-generic #34~22.04.1-Ubuntu [ 567.614502] Hardware name: AMD Splinter/Splinter-RPL, BIOS WS43927N_871 09/25/2023 [ 567.614504] Workqueue: events send_exception_work_handler [amdgpu] [ 567.614748] Call Trace: [ 567.614750] <TASK> [ 567.614753] dump_stack_lvl+0x48/0x70 [ 567.614761] dump_stack+0x10/0x20 [ 567.614763] __ubsan_handle_shift_out_of_bounds+0x156/0x310 [ 567.614769] ? srso_alias_return_thunk+0x5/0x7f [ 567.614773] ? update_sd_lb_stats.constprop.0+0xf2/0x3c0 [ 567.614780] svm_range_split_by_granularity.cold+0x2b/0x34 [amdgpu] [ 567.615047] ? srso_alias_return_thunk+0x5/0x7f [ 567.615052] svm_migrate_to_ram+0x185/0x4d0 [amdgpu] [ 567.615286] do_swap_page+0x7b6/0xa30 [ 567.615291] ? srso_alias_return_thunk+0x5/0x7f [ 567.615294] ? __free_pages+0x119/0x130 [ 567.615299] handle_pte_fault+0x227/0x280 [ 567.615303] __handle_mm_fault+0x3c0/0x720 [ 567.615311] handle_mm_fault+0x119/0x330 [ 567.615314] ? lock_mm_and_find_vma+0x44/0x250 [ 567.615318] do_user_addr_fault+0x1a9/0x640 [ 567.615323] exc_page_fault+0x81/0x1b0 [ 567.615328] asm_exc_page_fault+0x27/0x30 [ 567.615332] RIP: 0010:__get_user_8+0x1c/0x30 Signed-off-by: Jesse Zhang <Jesse.Zhang@xxxxxxx> --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index 7b81233bc9ae..f5e0bccc6d71 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -1169,7 +1169,7 @@ svm_range_split_by_granularity(struct kfd_process *p, struct mm_struct *mm, * PTE will be used for whole range, this reduces the number of PTE * updated and the L1 TLB space used for translation. */ - size = 1UL << prange->granularity; + size = 1UL << (prange->granularity & 0x3f); start = ALIGN_DOWN(addr, size); last = ALIGN(addr + 1, size) - 1; -- 2.25.1