Re: Stack out of bounds in KFD on Arcturus

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



No problem on Vega 20

Andrey

On 10/22/19 1:46 PM, Zeng, Oak wrote:
> Sorry I searched my kconfig and I didn't find the stack size configure anymore...Maybe today kernel stack size is not configurable anymore...
>
> Can you try your kernel on vega10 or 20 or navi10? We want to know whether this is mi100 specific issue.
>
> Oak
>
> -----Original Message-----
> From: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>
> Sent: Tuesday, October 22, 2019 1:28 PM
> To: Zeng, Oak <Oak.Zeng@xxxxxxx>; Kuehling, Felix <Felix.Kuehling@xxxxxxx>
> Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Subject: Re: Stack out of bounds in KFD on Arcturus
>
> I don't know - what Kconfig flag should I look at ?
>
> Andrey
>
> On 10/22/19 1:17 PM, Zeng, Oak wrote:
>> Sorry I meant is the kernel stack size 16KB in your kconfig?
>>
>> Oak
>>
>> -----Original Message-----
>> From: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>
>> Sent: Tuesday, October 22, 2019 12:49 PM
>> To: Zeng, Oak <Oak.Zeng@xxxxxxx>; Kuehling, Felix
>> <Felix.Kuehling@xxxxxxx>
>> Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>> Subject: Re: Stack out of bounds in KFD on Arcturus
>>
>> On 10/18/19 5:31 PM, Zeng, Oak wrote:
>>
>>> Hi Andrey,
>>>
>>> What is your system configuration? I didn’t see this issue before. Also see attached QA's configuration - you can compare to see any difference.
>> Attached is my lshw
>>
>>> Also I believe for x86-64, the default kernel stack size is 16kb? Is this your Kconfig?
>> What do you mean if this is my Kconfig ? Is there particular Kconfig flag you know that i can look for ?
>>
>> Andrey
>>
>>
>>> Regards,
>>> Oak
>>>
>>> -----Original Message-----
>>> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of
>>> Kuehling, Felix
>>> Sent: Friday, October 18, 2019 4:55 PM
>>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>
>>> Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>> Subject: Re: Stack out of bounds in KFD on Arcturus
>>>
>>> On 2019-10-17 6:38 p.m., Grodzovsky, Andrey wrote:
>>>> Not that I aware of, is there a special Kconfig flag to determine
>>>> stack size ?
>>> I remember there used to be a Kconfig option to force a 4KB kernel stack. I don't see it in the current kernel any more.
>>>
>>> I don't have time to work on this myself. I'll create a ticket and see if I can find someone to investigate.
>>>
>>> Thanks,
>>>       Felix
>>>
>>>
>>>> Andrey
>>>>
>>>> On 10/17/19 5:29 PM, Kuehling, Felix wrote:
>>>>> I don't see why this problem would be specific to Arcturus. I don't
>>>>> see any excessive allocations on the stack either. Also the code
>>>>> involved here hasn't changed recently.
>>>>>
>>>>> Are you using some weird kernel config with a smaller stack? Is it
>>>>> specific to a compiler version or some optimization flags? I've
>>>>> sometimes seen function inlining cause excessive stack usage.
>>>>>
>>>>> Regards,
>>>>>         Felix
>>>>>
>>>>> On 2019-10-17 4:09 p.m., Grodzovsky, Andrey wrote:
>>>>>> He Felix - I see this on boot when working with Arcturus.
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>> [  103.602092] kfd kfd: Allocated 3969056 bytes on gart [
>>>>>> 103.610769]
>>>>>> ==================================================================
>>>>>> [  103.611469] BUG: KASAN: stack-out-of-bounds in
>>>>>> kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [  103.611646]
>>>>>> Read of size 4 at addr ffff8883cb19ee38 by task modprobe/1122
>>>>>>
>>>>>> [  103.611836] CPU: 3 PID: 1122 Comm: modprobe Tainted: G O
>>>>>> 5.3.0-rc3+ #45 [  103.611847] Hardware name: System manufacturer
>>>>>> System Product Name/Z170-PRO, BIOS 1902 06/27/2016 [  103.611856]
>>>>>> Call Trace:
>>>>>> [  103.611879]  dump_stack+0x71/0xab [  103.611907]
>>>>>> print_address_description+0x1da/0x3c0
>>>>>> [  103.612453]  ? kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu]
>>>>>> [ 103.612479]  __kasan_report+0x13f/0x1a0 [  103.613022]  ?
>>>>>> kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [  103.613580]  ?
>>>>>> kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [  103.613604]
>>>>>> kasan_report+0xe/0x20 [  103.614149]
>>>>>> kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [  103.614762]  ?
>>>>>> kfd_fill_gpu_memory_affinity+0x110/0x110 [amdgpu] [  103.614796]  ?
>>>>>> __alloc_pages_nodemask+0x2c9/0x560
>>>>>> [  103.614824]  ? __alloc_pages_slowpath+0x1390/0x1390
>>>>>> [  103.614898]  ? kmalloc_order+0x63/0x70 [  103.615469]
>>>>>> kfd_create_crat_image_virtual+0x70c/0x770 [amdgpu] [  103.616054]  ?
>>>>>> kfd_create_crat_image_acpi+0x1c0/0x1c0 [amdgpu] [  103.616095]  ?
>>>>>> up_write+0x4b/0x70 [  103.616649]
>>>>>> kfd_topology_add_device+0x98d/0xb10 [amdgpu] [  103.617207]  ?
>>>>>> kfd_topology_shutdown+0x60/0x60 [amdgpu] [  103.617743]  ?
>>>>>> start_cpsch+0x2ff/0x3a0 [amdgpu] [  103.617777]  ?
>>>>>> mutex_lock_io_nested+0xac0/0xac0 [  103.617807]  ?
>>>>>> __mutex_unlock_slowpath+0xda/0x420
>>>>>> [  103.617848]  ? __mutex_unlock_slowpath+0xda/0x420
>>>>>> [  103.617877]  ? wait_for_completion+0x200/0x200 [  103.618461]  ?
>>>>>> start_cpsch+0x38b/0x3a0 [amdgpu] [  103.619011]  ?
>>>>>> create_queue_cpsch+0x670/0x670 [amdgpu] [  103.619573]  ?
>>>>>> kfd_iommu_device_init+0x92/0x1e0 [amdgpu] [  103.620112]  ?
>>>>>> kfd_iommu_resume+0x2c/0x2c0 [amdgpu] [  103.620655]  ?
>>>>>> kfd_iommu_check_device+0xf0/0xf0 [amdgpu] [  103.621228]
>>>>>> kgd2kfd_device_init+0x474/0x870 [amdgpu] [  103.621781]
>>>>>> amdgpu_amdkfd_device_init+0x291/0x390 [amdgpu] [  103.622329]  ?
>>>>>> amdgpu_amdkfd_device_probe+0x90/0x90 [amdgpu] [  103.622344]  ?
>>>>>> kmsg_dump_rewind_nolock+0x59/0x59 [  103.622895]  ?
>>>>>> amdgpu_ras_eeprom_test+0x71/0x90 [amdgpu] [  103.623424]
>>>>>> amdgpu_device_init+0x1bbe/0x2f00 [amdgpu] [  103.623819]  ?
>>>>>> amdgpu_device_has_dc_support+0x30/0x30 [amdgpu] [  103.623842]  ?
>>>>>> __isolate_free_page+0x290/0x290 [  103.623852]  ?
>>>>>> fs_reclaim_acquire.part.97+0x5/0x30
>>>>>> [  103.623891]  ? __alloc_pages_nodemask+0x2c9/0x560
>>>>>> [  103.623912]  ? __alloc_pages_slowpath+0x1390/0x1390
>>>>>> [  103.623945]  ? kasan_unpoison_shadow+0x31/0x40 [  103.623970]  ?
>>>>>> kmalloc_order+0x63/0x70 [  103.624337]
>>>>>> amdgpu_driver_load_kms+0xd9/0x430 [amdgpu] [  103.624690]  ?
>>>>>> amdgpu_register_gpu_instance+0xe0/0xe0 [amdgpu] [  103.624756]  ?
>>>>>> drm_dev_register+0x19c/0x310 [drm] [  103.624768]  ?
>>>>>> __kasan_slab_free+0x133/0x160 [  103.624849]
>>>>>> drm_dev_register+0x1f5/0x310 [drm] [  103.625212]
>>>>>> amdgpu_pci_probe+0x109/0x1f0 [amdgpu] [  103.625565]  ?
>>>>>> amdgpu_pmops_runtime_idle+0xe0/0xe0 [amdgpu] [  103.625580]
>>>>>> local_pci_probe+0x74/0xd0 [  103.625603]
>>>>>> pci_device_probe+0x1fa/0x310 [  103.625620]  ?
>>>>>> pci_device_remove+0x1c0/0x1c0 [  103.625640]  ?
>>>>>> sysfs_do_create_link_sd.isra.2+0x74/0xe0
>>>>>> [  103.625673]  really_probe+0x367/0x5d0 [  103.625700]
>>>>>> driver_probe_device+0x177/0x1b0 [  103.625721]
>>>>>> device_driver_attach+0x8a/0x90 [  103.625737]  ?
>>>>>> device_driver_attach+0x90/0x90 [  103.625746]
>>>>>> __driver_attach+0xeb/0x190 [  103.625765]  ?
>>>>>> device_driver_attach+0x90/0x90 [  103.625773]
>>>>>> bus_for_each_dev+0xe4/0x160 [  103.625789]  ?
>>>>>> subsys_dev_iter_exit+0x10/0x10 [  103.625829]
>>>>>> bus_add_driver+0x277/0x330 [  103.625855]
>>>>>> driver_register+0xc6/0x1a0 [  103.625866]  ? 0xffffffffa0d88000 [
>>>>>> 103.625880]  do_one_initcall+0xd3/0x334 [  103.625895]  ?
>>>>>> trace_event_raw_event_initcall_finish+0x150/0x150
>>>>>> [  103.625911]  ? kasan_unpoison_shadow+0x31/0x40 [  103.625924]  ?
>>>>>> __kasan_kmalloc+0xd5/0xf0 [  103.625946]  ?
>>>>>> kmem_cache_alloc_trace+0x154/0x300
>>>>>> [  103.625955]  ? kasan_unpoison_shadow+0x31/0x40 [  103.625985]
>>>>>> do_init_module+0xec/0x354 [  103.626011]
>>>>>> load_module+0x3c91/0x4980 [  103.626118]  ?
>>>>>> module_frob_arch_sections+0x20/0x20
>>>>>> [  103.626132]  ? ima_read_file+0x10/0x10 [  103.626142]  ?
>>>>>> vfs_read+0x127/0x190 [  103.626163]  ? kernel_read+0x95/0xb0 [
>>>>>> 103.626187]  ? kernel_read_file+0x1a5/0x340 [  103.626277]  ?
>>>>>> __do_sys_finit_module+0x175/0x1b0 [  103.626287]
>>>>>> __do_sys_finit_module+0x175/0x1b0 [  103.626301]  ?
>>>>>> __ia32_sys_init_module+0x40/0x40 [  103.626338]  ?
>>>>>> lock_downgrade+0x390/0x390 [  103.626396]  ?
>>>>>> vtime_user_exit+0xc8/0xe0 [  103.626423]  do_syscall_64+0x7d/0x250
>>>>>> [ 103.626440]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>>>> [  103.626450] RIP: 0033:0x7f09984854d9 [  103.626461] Code: 00 f3
>>>>>> c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00
>>>>>> 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
>>>>>> 08 0f
>>>>>> 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8f 29 2c 00 f7 d8 64 89
>>>>>> 01
>>>>>> 48 [  103.626468] RSP: 002b:00007ffc42896008 EFLAGS: 00000246 ORIG_RAX:
>>>>>> 0000000000000139
>>>>>> [  103.626479] RAX: ffffffffffffffda RBX: 0000559a52495400 RCX:
>>>>>> 00007f09984854d9
>>>>>> [  103.626486] RDX: 0000000000000000 RSI: 0000559a52499900 RDI:
>>>>>> 0000000000000006
>>>>>> [  103.626493] RBP: 0000559a52499900 R08: 0000000000000000 R09:
>>>>>> 0000000000000000
>>>>>> [  103.626500] R10: 0000000000000006 R11: 0000000000000246 R12:
>>>>>> 0000000000000000
>>>>>> [  103.626508] R13: 0000559a52499b30 R14: 0000000000040000 R15:
>>>>>> 0000000000000013
>>>>>>
>>>>>> [  103.626592] The buggy address belongs to the page:
>>>>>> [  103.626665] page:ffffea000f2c6780 refcount:0 mapcount:0
>>>>>> mapping:0000000000000000 index:0x0 [  103.626675] flags:
>>>>>> 0x2ffff0000000000() [  103.626686] raw:
>>>>>> 02ffff0000000000 0000000000000000 ffffea000f2c6788
>>>>>> 0000000000000000
>>>>>> [  103.626696] raw: 0000000000000000 0000000000000000
>>>>>> 00000000ffffffff
>>>>>> 0000000000000000
>>>>>> [  103.626702] page dumped because: kasan: bad access detected
>>>>>>
>>>>>> [  103.626742] addr ffff8883cb19ee38 is located in stack of task
>>>>>> modprobe/1122 at offset 264 in frame:
>>>>>> [  103.627233]  kfd_create_vcrat_image_gpu+0x0/0xb80 [amdgpu]
>>>>>>
>>>>>> [  103.627346] this frame has 3 objects:
>>>>>> [  103.627405]  [32, 36) 'avail_size'
>>>>>> [  103.627410]  [96, 120) 'local_mem_info'
>>>>>> [  103.627466]  [160, 264) 'cu_info'
>>>>>>
>>>>>> [  103.627602] Memory state around the buggy address:
>>>>>> [  103.627675]  ffff8883cb19ed00: 00 00 00 00 00 00 f1 f1 f1 f1 04
>>>>>> f4 f4
>>>>>> f4 f2 f2
>>>>>> [  103.627780]  ffff8883cb19ed80: f2 f2 00 00 00 f4 f2 f2 f2 f2 00
>>>>>> 00 00
>>>>>> 00 00 00
>>>>>> [  103.627885] >ffff8883cb19ee00: 00 00 00 00 00 00 00 f4 f4 f4 f3
>>>>>> f3 f3
>>>>>> f3 00 00
>>>>>> [  103.627989]                                         ^ [
>>>>>> 103.628065]  ffff8883cb19ee80: 00 00 00 00 00 00 00 00 00 00 00 00
>>>>>> 00
>>>>>> 00 00 00
>>>>>> [  103.628169]  ffff8883cb19ef00: f1 f1 f1 f1 00 f4 f4 f4 f3 f3 f3
>>>>>> f3 00
>>>>>> 00 00 00
>>>>>> [  103.628273]
>>>>>> ==================================================================
>>>>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux