On 10/18/2023 9:53 AM, Philip Yang wrote:
Caution: This message originated from an External Source. Use proper
caution when opening attachments, clicking links, or responding.
The 255 granularity is from recent Thunk change to increase CWSR area
granularity.
I think we also need to do same restriction at correspondent parts in
Thunk when set range granularity. Setting 255 granularity is far big as
granularity here is log value of page number.
Regards
Xiaogang
Thanks for catching this with kernel debug option
CC_HAS_UBSAN_ARRAY_BOUNDS enabled. Because 1<<prange->granularity is
used in many places, I think the proper fix should be in function
svm_range_apply_attrs
case KFD_IOCTL_SVM_ATTR_GRANULARITY:
- prange->granlarity = attrs[i].value;
+ prange->granlarity = attrs[i].value & 0x3F;
BTW, function svm_range_split_by_granularity() is not used anymore,
forgot the remove it, maybe you are testing on older source code?
Regards,
Philip
On 2023-10-18 09:36, Zhang, Yifan wrote:
[AMD Official Use Only - General]
Hi Jesse,
This patch is only a WA for the error log. How is this issue reproduced ? 255 looks like an invalid value for a prange->granularity, it is better to root cause who set it in the first place.
BRs,
Yifan
-----Original Message-----
From: Jesse Zhang<jesse.zhang@xxxxxxx>
Sent: Wednesday, October 18, 2023 3:50 PM
To:amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Deucher, Alexander<Alexander.Deucher@xxxxxxx>; Kuehling, Felix<Felix.Kuehling@xxxxxxx>; Yang, Philip<Philip.Yang@xxxxxxx>; Zhang, Yifan<Yifan1.Zhang@xxxxxxx>; Zhang, Jesse(Jie)<Jesse.Zhang@xxxxxxx>; Zhang, Jesse(Jie)<Jesse.Zhang@xxxxxxx>
Subject: [PATCH] drm/amdkfd: Fix shift out-of-bounds issue
[ 567.613292] shift exponent 255 is too large for 64-bit type 'long unsigned int'
[ 567.614498] CPU: 5 PID: 238 Comm: kworker/5:1 Tainted: G OE 6.2.0-34-generic #34~22.04.1-Ubuntu
[ 567.614502] Hardware name: AMD Splinter/Splinter-RPL, BIOS WS43927N_871 09/25/2023 [ 567.614504] Workqueue: events send_exception_work_handler [amdgpu] [ 567.614748] Call Trace:
[ 567.614750] <TASK>
[ 567.614753] dump_stack_lvl+0x48/0x70 [ 567.614761] dump_stack+0x10/0x20 [ 567.614763] __ubsan_handle_shift_out_of_bounds+0x156/0x310
[ 567.614769] ? srso_alias_return_thunk+0x5/0x7f [ 567.614773] ? update_sd_lb_stats.constprop.0+0xf2/0x3c0
[ 567.614780] svm_range_split_by_granularity.cold+0x2b/0x34 [amdgpu] [ 567.615047] ? srso_alias_return_thunk+0x5/0x7f [ 567.615052] svm_migrate_to_ram+0x185/0x4d0 [amdgpu] [ 567.615286] do_swap_page+0x7b6/0xa30 [ 567.615291] ? srso_alias_return_thunk+0x5/0x7f [ 567.615294] ? __free_pages+0x119/0x130 [ 567.615299] handle_pte_fault+0x227/0x280 [ 567.615303] __handle_mm_fault+0x3c0/0x720 [ 567.615311] handle_mm_fault+0x119/0x330 [ 567.615314] ? lock_mm_and_find_vma+0x44/0x250 [ 567.615318] do_user_addr_fault+0x1a9/0x640 [ 567.615323] exc_page_fault+0x81/0x1b0 [ 567.615328] asm_exc_page_fault+0x27/0x30 [ 567.615332] RIP: 0010:__get_user_8+0x1c/0x30
Signed-off-by: Jesse Zhang<Jesse.Zhang@xxxxxxx>
---
drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 7b81233bc9ae..f5e0bccc6d71 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -1169,7 +1169,7 @@ svm_range_split_by_granularity(struct kfd_process *p, struct mm_struct *mm,
* PTE will be used for whole range, this reduces the number of PTE
* updated and the L1 TLB space used for translation.
*/
- size = 1UL << prange->granularity;
+ size = 1UL << (prange->granularity & 0x3f);
start = ALIGN_DOWN(addr, size);
last = ALIGN(addr + 1, size) - 1;
--
2.25.1