No problem on Vega 20 Andrey On 10/22/19 1:46 PM, Zeng, Oak wrote: > Sorry I searched my kconfig and I didn't find the stack size configure anymore...Maybe today kernel stack size is not configurable anymore... > > Can you try your kernel on vega10 or 20 or navi10? We want to know whether this is mi100 specific issue. > > Oak > > -----Original Message----- > From: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx> > Sent: Tuesday, October 22, 2019 1:28 PM > To: Zeng, Oak <Oak.Zeng@xxxxxxx>; Kuehling, Felix <Felix.Kuehling@xxxxxxx> > Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Subject: Re: Stack out of bounds in KFD on Arcturus > > I don't know - what Kconfig flag should I look at ? > > Andrey > > On 10/22/19 1:17 PM, Zeng, Oak wrote: >> Sorry I meant is the kernel stack size 16KB in your kconfig? >> >> Oak >> >> -----Original Message----- >> From: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx> >> Sent: Tuesday, October 22, 2019 12:49 PM >> To: Zeng, Oak <Oak.Zeng@xxxxxxx>; Kuehling, Felix >> <Felix.Kuehling@xxxxxxx> >> Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx >> Subject: Re: Stack out of bounds in KFD on Arcturus >> >> On 10/18/19 5:31 PM, Zeng, Oak wrote: >> >>> Hi Andrey, >>> >>> What is your system configuration? I didn’t see this issue before. Also see attached QA's configuration - you can compare to see any difference. >> Attached is my lshw >> >>> Also I believe for x86-64, the default kernel stack size is 16kb? Is this your Kconfig? >> What do you mean if this is my Kconfig ? Is there particular Kconfig flag you know that i can look for ? >> >> Andrey >> >> >>> Regards, >>> Oak >>> >>> -----Original Message----- >>> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of >>> Kuehling, Felix >>> Sent: Friday, October 18, 2019 4:55 PM >>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx> >>> Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>> Subject: Re: Stack out of bounds in KFD on Arcturus >>> >>> On 2019-10-17 6:38 p.m., Grodzovsky, Andrey wrote: >>>> Not that I aware of, is there a special Kconfig flag to determine >>>> stack size ? >>> I remember there used to be a Kconfig option to force a 4KB kernel stack. I don't see it in the current kernel any more. >>> >>> I don't have time to work on this myself. I'll create a ticket and see if I can find someone to investigate. >>> >>> Thanks, >>> Felix >>> >>> >>>> Andrey >>>> >>>> On 10/17/19 5:29 PM, Kuehling, Felix wrote: >>>>> I don't see why this problem would be specific to Arcturus. I don't >>>>> see any excessive allocations on the stack either. Also the code >>>>> involved here hasn't changed recently. >>>>> >>>>> Are you using some weird kernel config with a smaller stack? Is it >>>>> specific to a compiler version or some optimization flags? I've >>>>> sometimes seen function inlining cause excessive stack usage. >>>>> >>>>> Regards, >>>>> Felix >>>>> >>>>> On 2019-10-17 4:09 p.m., Grodzovsky, Andrey wrote: >>>>>> He Felix - I see this on boot when working with Arcturus. >>>>>> >>>>>> Andrey >>>>>> >>>>>> >>>>>> [ 103.602092] kfd kfd: Allocated 3969056 bytes on gart [ >>>>>> 103.610769] >>>>>> ================================================================== >>>>>> [ 103.611469] BUG: KASAN: stack-out-of-bounds in >>>>>> kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [ 103.611646] >>>>>> Read of size 4 at addr ffff8883cb19ee38 by task modprobe/1122 >>>>>> >>>>>> [ 103.611836] CPU: 3 PID: 1122 Comm: modprobe Tainted: G O >>>>>> 5.3.0-rc3+ #45 [ 103.611847] Hardware name: System manufacturer >>>>>> System Product Name/Z170-PRO, BIOS 1902 06/27/2016 [ 103.611856] >>>>>> Call Trace: >>>>>> [ 103.611879] dump_stack+0x71/0xab [ 103.611907] >>>>>> print_address_description+0x1da/0x3c0 >>>>>> [ 103.612453] ? kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] >>>>>> [ 103.612479] __kasan_report+0x13f/0x1a0 [ 103.613022] ? >>>>>> kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [ 103.613580] ? >>>>>> kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [ 103.613604] >>>>>> kasan_report+0xe/0x20 [ 103.614149] >>>>>> kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [ 103.614762] ? >>>>>> kfd_fill_gpu_memory_affinity+0x110/0x110 [amdgpu] [ 103.614796] ? >>>>>> __alloc_pages_nodemask+0x2c9/0x560 >>>>>> [ 103.614824] ? __alloc_pages_slowpath+0x1390/0x1390 >>>>>> [ 103.614898] ? kmalloc_order+0x63/0x70 [ 103.615469] >>>>>> kfd_create_crat_image_virtual+0x70c/0x770 [amdgpu] [ 103.616054] ? >>>>>> kfd_create_crat_image_acpi+0x1c0/0x1c0 [amdgpu] [ 103.616095] ? >>>>>> up_write+0x4b/0x70 [ 103.616649] >>>>>> kfd_topology_add_device+0x98d/0xb10 [amdgpu] [ 103.617207] ? >>>>>> kfd_topology_shutdown+0x60/0x60 [amdgpu] [ 103.617743] ? >>>>>> start_cpsch+0x2ff/0x3a0 [amdgpu] [ 103.617777] ? >>>>>> mutex_lock_io_nested+0xac0/0xac0 [ 103.617807] ? >>>>>> __mutex_unlock_slowpath+0xda/0x420 >>>>>> [ 103.617848] ? __mutex_unlock_slowpath+0xda/0x420 >>>>>> [ 103.617877] ? wait_for_completion+0x200/0x200 [ 103.618461] ? >>>>>> start_cpsch+0x38b/0x3a0 [amdgpu] [ 103.619011] ? >>>>>> create_queue_cpsch+0x670/0x670 [amdgpu] [ 103.619573] ? >>>>>> kfd_iommu_device_init+0x92/0x1e0 [amdgpu] [ 103.620112] ? >>>>>> kfd_iommu_resume+0x2c/0x2c0 [amdgpu] [ 103.620655] ? >>>>>> kfd_iommu_check_device+0xf0/0xf0 [amdgpu] [ 103.621228] >>>>>> kgd2kfd_device_init+0x474/0x870 [amdgpu] [ 103.621781] >>>>>> amdgpu_amdkfd_device_init+0x291/0x390 [amdgpu] [ 103.622329] ? >>>>>> amdgpu_amdkfd_device_probe+0x90/0x90 [amdgpu] [ 103.622344] ? >>>>>> kmsg_dump_rewind_nolock+0x59/0x59 [ 103.622895] ? >>>>>> amdgpu_ras_eeprom_test+0x71/0x90 [amdgpu] [ 103.623424] >>>>>> amdgpu_device_init+0x1bbe/0x2f00 [amdgpu] [ 103.623819] ? >>>>>> amdgpu_device_has_dc_support+0x30/0x30 [amdgpu] [ 103.623842] ? >>>>>> __isolate_free_page+0x290/0x290 [ 103.623852] ? >>>>>> fs_reclaim_acquire.part.97+0x5/0x30 >>>>>> [ 103.623891] ? __alloc_pages_nodemask+0x2c9/0x560 >>>>>> [ 103.623912] ? __alloc_pages_slowpath+0x1390/0x1390 >>>>>> [ 103.623945] ? kasan_unpoison_shadow+0x31/0x40 [ 103.623970] ? >>>>>> kmalloc_order+0x63/0x70 [ 103.624337] >>>>>> amdgpu_driver_load_kms+0xd9/0x430 [amdgpu] [ 103.624690] ? >>>>>> amdgpu_register_gpu_instance+0xe0/0xe0 [amdgpu] [ 103.624756] ? >>>>>> drm_dev_register+0x19c/0x310 [drm] [ 103.624768] ? >>>>>> __kasan_slab_free+0x133/0x160 [ 103.624849] >>>>>> drm_dev_register+0x1f5/0x310 [drm] [ 103.625212] >>>>>> amdgpu_pci_probe+0x109/0x1f0 [amdgpu] [ 103.625565] ? >>>>>> amdgpu_pmops_runtime_idle+0xe0/0xe0 [amdgpu] [ 103.625580] >>>>>> local_pci_probe+0x74/0xd0 [ 103.625603] >>>>>> pci_device_probe+0x1fa/0x310 [ 103.625620] ? >>>>>> pci_device_remove+0x1c0/0x1c0 [ 103.625640] ? >>>>>> sysfs_do_create_link_sd.isra.2+0x74/0xe0 >>>>>> [ 103.625673] really_probe+0x367/0x5d0 [ 103.625700] >>>>>> driver_probe_device+0x177/0x1b0 [ 103.625721] >>>>>> device_driver_attach+0x8a/0x90 [ 103.625737] ? >>>>>> device_driver_attach+0x90/0x90 [ 103.625746] >>>>>> __driver_attach+0xeb/0x190 [ 103.625765] ? >>>>>> device_driver_attach+0x90/0x90 [ 103.625773] >>>>>> bus_for_each_dev+0xe4/0x160 [ 103.625789] ? >>>>>> subsys_dev_iter_exit+0x10/0x10 [ 103.625829] >>>>>> bus_add_driver+0x277/0x330 [ 103.625855] >>>>>> driver_register+0xc6/0x1a0 [ 103.625866] ? 0xffffffffa0d88000 [ >>>>>> 103.625880] do_one_initcall+0xd3/0x334 [ 103.625895] ? >>>>>> trace_event_raw_event_initcall_finish+0x150/0x150 >>>>>> [ 103.625911] ? kasan_unpoison_shadow+0x31/0x40 [ 103.625924] ? >>>>>> __kasan_kmalloc+0xd5/0xf0 [ 103.625946] ? >>>>>> kmem_cache_alloc_trace+0x154/0x300 >>>>>> [ 103.625955] ? kasan_unpoison_shadow+0x31/0x40 [ 103.625985] >>>>>> do_init_module+0xec/0x354 [ 103.626011] >>>>>> load_module+0x3c91/0x4980 [ 103.626118] ? >>>>>> module_frob_arch_sections+0x20/0x20 >>>>>> [ 103.626132] ? ima_read_file+0x10/0x10 [ 103.626142] ? >>>>>> vfs_read+0x127/0x190 [ 103.626163] ? kernel_read+0x95/0xb0 [ >>>>>> 103.626187] ? kernel_read_file+0x1a5/0x340 [ 103.626277] ? >>>>>> __do_sys_finit_module+0x175/0x1b0 [ 103.626287] >>>>>> __do_sys_finit_module+0x175/0x1b0 [ 103.626301] ? >>>>>> __ia32_sys_init_module+0x40/0x40 [ 103.626338] ? >>>>>> lock_downgrade+0x390/0x390 [ 103.626396] ? >>>>>> vtime_user_exit+0xc8/0xe0 [ 103.626423] do_syscall_64+0x7d/0x250 >>>>>> [ 103.626440] entry_SYSCALL_64_after_hwframe+0x44/0xa9 >>>>>> [ 103.626450] RIP: 0033:0x7f09984854d9 [ 103.626461] Code: 00 f3 >>>>>> c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 >>>>>> 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 >>>>>> 08 0f >>>>>> 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8f 29 2c 00 f7 d8 64 89 >>>>>> 01 >>>>>> 48 [ 103.626468] RSP: 002b:00007ffc42896008 EFLAGS: 00000246 ORIG_RAX: >>>>>> 0000000000000139 >>>>>> [ 103.626479] RAX: ffffffffffffffda RBX: 0000559a52495400 RCX: >>>>>> 00007f09984854d9 >>>>>> [ 103.626486] RDX: 0000000000000000 RSI: 0000559a52499900 RDI: >>>>>> 0000000000000006 >>>>>> [ 103.626493] RBP: 0000559a52499900 R08: 0000000000000000 R09: >>>>>> 0000000000000000 >>>>>> [ 103.626500] R10: 0000000000000006 R11: 0000000000000246 R12: >>>>>> 0000000000000000 >>>>>> [ 103.626508] R13: 0000559a52499b30 R14: 0000000000040000 R15: >>>>>> 0000000000000013 >>>>>> >>>>>> [ 103.626592] The buggy address belongs to the page: >>>>>> [ 103.626665] page:ffffea000f2c6780 refcount:0 mapcount:0 >>>>>> mapping:0000000000000000 index:0x0 [ 103.626675] flags: >>>>>> 0x2ffff0000000000() [ 103.626686] raw: >>>>>> 02ffff0000000000 0000000000000000 ffffea000f2c6788 >>>>>> 0000000000000000 >>>>>> [ 103.626696] raw: 0000000000000000 0000000000000000 >>>>>> 00000000ffffffff >>>>>> 0000000000000000 >>>>>> [ 103.626702] page dumped because: kasan: bad access detected >>>>>> >>>>>> [ 103.626742] addr ffff8883cb19ee38 is located in stack of task >>>>>> modprobe/1122 at offset 264 in frame: >>>>>> [ 103.627233] kfd_create_vcrat_image_gpu+0x0/0xb80 [amdgpu] >>>>>> >>>>>> [ 103.627346] this frame has 3 objects: >>>>>> [ 103.627405] [32, 36) 'avail_size' >>>>>> [ 103.627410] [96, 120) 'local_mem_info' >>>>>> [ 103.627466] [160, 264) 'cu_info' >>>>>> >>>>>> [ 103.627602] Memory state around the buggy address: >>>>>> [ 103.627675] ffff8883cb19ed00: 00 00 00 00 00 00 f1 f1 f1 f1 04 >>>>>> f4 f4 >>>>>> f4 f2 f2 >>>>>> [ 103.627780] ffff8883cb19ed80: f2 f2 00 00 00 f4 f2 f2 f2 f2 00 >>>>>> 00 00 >>>>>> 00 00 00 >>>>>> [ 103.627885] >ffff8883cb19ee00: 00 00 00 00 00 00 00 f4 f4 f4 f3 >>>>>> f3 f3 >>>>>> f3 00 00 >>>>>> [ 103.627989] ^ [ >>>>>> 103.628065] ffff8883cb19ee80: 00 00 00 00 00 00 00 00 00 00 00 00 >>>>>> 00 >>>>>> 00 00 00 >>>>>> [ 103.628169] ffff8883cb19ef00: f1 f1 f1 f1 00 f4 f4 f4 f3 f3 f3 >>>>>> f3 00 >>>>>> 00 00 00 >>>>>> [ 103.628273] >>>>>> ================================================================== >>>>>> >>> _______________________________________________ >>> amd-gfx mailing list >>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx