RE: Stack out of bounds in KFD on Arcturus

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Andrey, 

What is your system configuration? I didn’t see this issue before. Also see attached QA's configuration - you can compare to see any difference.

Also I believe for x86-64, the default kernel stack size is 16kb? Is this your Kconfig?

Regards,
Oak

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of Kuehling, Felix
Sent: Friday, October 18, 2019 4:55 PM
To: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>
Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Subject: Re: Stack out of bounds in KFD on Arcturus

On 2019-10-17 6:38 p.m., Grodzovsky, Andrey wrote:
> Not that I aware of, is there a special Kconfig flag to determine 
> stack size ?

I remember there used to be a Kconfig option to force a 4KB kernel stack. I don't see it in the current kernel any more.

I don't have time to work on this myself. I'll create a ticket and see if I can find someone to investigate.

Thanks,
   Felix


>
> Andrey
>
> On 10/17/19 5:29 PM, Kuehling, Felix wrote:
>> I don't see why this problem would be specific to Arcturus. I don't 
>> see any excessive allocations on the stack either. Also the code 
>> involved here hasn't changed recently.
>>
>> Are you using some weird kernel config with a smaller stack? Is it 
>> specific to a compiler version or some optimization flags? I've 
>> sometimes seen function inlining cause excessive stack usage.
>>
>> Regards,
>>      Felix
>>
>> On 2019-10-17 4:09 p.m., Grodzovsky, Andrey wrote:
>>> He Felix - I see this on boot when working with Arcturus.
>>>
>>> Andrey
>>>
>>>
>>> [  103.602092] kfd kfd: Allocated 3969056 bytes on gart [  
>>> 103.610769] 
>>> ==================================================================
>>> [  103.611469] BUG: KASAN: stack-out-of-bounds in
>>> kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [  103.611646] Read 
>>> of size 4 at addr ffff8883cb19ee38 by task modprobe/1122
>>>
>>> [  103.611836] CPU: 3 PID: 1122 Comm: modprobe Tainted: G O      
>>> 5.3.0-rc3+ #45 [  103.611847] Hardware name: System manufacturer 
>>> System Product Name/Z170-PRO, BIOS 1902 06/27/2016 [  103.611856] 
>>> Call Trace:
>>> [  103.611879]  dump_stack+0x71/0xab [  103.611907]  
>>> print_address_description+0x1da/0x3c0
>>> [  103.612453]  ? kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [  
>>> 103.612479]  __kasan_report+0x13f/0x1a0 [  103.613022]  ? 
>>> kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [  103.613580]  ? 
>>> kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [  103.613604]  
>>> kasan_report+0xe/0x20 [  103.614149]  
>>> kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [  103.614762]  ? 
>>> kfd_fill_gpu_memory_affinity+0x110/0x110 [amdgpu] [  103.614796]  ? 
>>> __alloc_pages_nodemask+0x2c9/0x560
>>> [  103.614824]  ? __alloc_pages_slowpath+0x1390/0x1390
>>> [  103.614898]  ? kmalloc_order+0x63/0x70 [  103.615469]  
>>> kfd_create_crat_image_virtual+0x70c/0x770 [amdgpu] [  103.616054]  ? 
>>> kfd_create_crat_image_acpi+0x1c0/0x1c0 [amdgpu] [  103.616095]  ? 
>>> up_write+0x4b/0x70 [  103.616649]  
>>> kfd_topology_add_device+0x98d/0xb10 [amdgpu] [  103.617207]  ? 
>>> kfd_topology_shutdown+0x60/0x60 [amdgpu] [  103.617743]  ? 
>>> start_cpsch+0x2ff/0x3a0 [amdgpu] [  103.617777]  ? 
>>> mutex_lock_io_nested+0xac0/0xac0 [  103.617807]  ? 
>>> __mutex_unlock_slowpath+0xda/0x420
>>> [  103.617848]  ? __mutex_unlock_slowpath+0xda/0x420
>>> [  103.617877]  ? wait_for_completion+0x200/0x200 [  103.618461]  ? 
>>> start_cpsch+0x38b/0x3a0 [amdgpu] [  103.619011]  ? 
>>> create_queue_cpsch+0x670/0x670 [amdgpu] [  103.619573]  ? 
>>> kfd_iommu_device_init+0x92/0x1e0 [amdgpu] [  103.620112]  ? 
>>> kfd_iommu_resume+0x2c/0x2c0 [amdgpu] [  103.620655]  ? 
>>> kfd_iommu_check_device+0xf0/0xf0 [amdgpu] [  103.621228]  
>>> kgd2kfd_device_init+0x474/0x870 [amdgpu] [  103.621781]  
>>> amdgpu_amdkfd_device_init+0x291/0x390 [amdgpu] [  103.622329]  ? 
>>> amdgpu_amdkfd_device_probe+0x90/0x90 [amdgpu] [  103.622344]  ? 
>>> kmsg_dump_rewind_nolock+0x59/0x59 [  103.622895]  ? 
>>> amdgpu_ras_eeprom_test+0x71/0x90 [amdgpu] [  103.623424]  
>>> amdgpu_device_init+0x1bbe/0x2f00 [amdgpu] [  103.623819]  ? 
>>> amdgpu_device_has_dc_support+0x30/0x30 [amdgpu] [  103.623842]  ? 
>>> __isolate_free_page+0x290/0x290 [  103.623852]  ? 
>>> fs_reclaim_acquire.part.97+0x5/0x30
>>> [  103.623891]  ? __alloc_pages_nodemask+0x2c9/0x560
>>> [  103.623912]  ? __alloc_pages_slowpath+0x1390/0x1390
>>> [  103.623945]  ? kasan_unpoison_shadow+0x31/0x40 [  103.623970]  ? 
>>> kmalloc_order+0x63/0x70 [  103.624337]  
>>> amdgpu_driver_load_kms+0xd9/0x430 [amdgpu] [  103.624690]  ? 
>>> amdgpu_register_gpu_instance+0xe0/0xe0 [amdgpu] [  103.624756]  ? 
>>> drm_dev_register+0x19c/0x310 [drm] [  103.624768]  ? 
>>> __kasan_slab_free+0x133/0x160 [  103.624849]  
>>> drm_dev_register+0x1f5/0x310 [drm] [  103.625212]  
>>> amdgpu_pci_probe+0x109/0x1f0 [amdgpu] [  103.625565]  ? 
>>> amdgpu_pmops_runtime_idle+0xe0/0xe0 [amdgpu] [  103.625580]  
>>> local_pci_probe+0x74/0xd0 [  103.625603]  
>>> pci_device_probe+0x1fa/0x310 [  103.625620]  ? 
>>> pci_device_remove+0x1c0/0x1c0 [  103.625640]  ? 
>>> sysfs_do_create_link_sd.isra.2+0x74/0xe0
>>> [  103.625673]  really_probe+0x367/0x5d0 [  103.625700]  
>>> driver_probe_device+0x177/0x1b0 [  103.625721]  
>>> device_driver_attach+0x8a/0x90 [  103.625737]  ? 
>>> device_driver_attach+0x90/0x90 [  103.625746]  
>>> __driver_attach+0xeb/0x190 [  103.625765]  ? 
>>> device_driver_attach+0x90/0x90 [  103.625773]  
>>> bus_for_each_dev+0xe4/0x160 [  103.625789]  ? 
>>> subsys_dev_iter_exit+0x10/0x10 [  103.625829]  
>>> bus_add_driver+0x277/0x330 [  103.625855]  
>>> driver_register+0xc6/0x1a0 [  103.625866]  ? 0xffffffffa0d88000 [  
>>> 103.625880]  do_one_initcall+0xd3/0x334 [  103.625895]  ? 
>>> trace_event_raw_event_initcall_finish+0x150/0x150
>>> [  103.625911]  ? kasan_unpoison_shadow+0x31/0x40 [  103.625924]  ? 
>>> __kasan_kmalloc+0xd5/0xf0 [  103.625946]  ? 
>>> kmem_cache_alloc_trace+0x154/0x300
>>> [  103.625955]  ? kasan_unpoison_shadow+0x31/0x40 [  103.625985]  
>>> do_init_module+0xec/0x354 [  103.626011]  load_module+0x3c91/0x4980 
>>> [  103.626118]  ? module_frob_arch_sections+0x20/0x20
>>> [  103.626132]  ? ima_read_file+0x10/0x10 [  103.626142]  ? 
>>> vfs_read+0x127/0x190 [  103.626163]  ? kernel_read+0x95/0xb0 [  
>>> 103.626187]  ? kernel_read_file+0x1a5/0x340 [  103.626277]  ? 
>>> __do_sys_finit_module+0x175/0x1b0 [  103.626287]  
>>> __do_sys_finit_module+0x175/0x1b0 [  103.626301]  ? 
>>> __ia32_sys_init_module+0x40/0x40 [  103.626338]  ? 
>>> lock_downgrade+0x390/0x390 [  103.626396]  ? 
>>> vtime_user_exit+0xc8/0xe0 [  103.626423]  do_syscall_64+0x7d/0x250 [  
>>> 103.626440]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> [  103.626450] RIP: 0033:0x7f09984854d9 [  103.626461] Code: 00 f3 
>>> c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00
>>> 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 
>>> 0f
>>> 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8f 29 2c 00 f7 d8 64 89 01 
>>> 48 [  103.626468] RSP: 002b:00007ffc42896008 EFLAGS: 00000246 ORIG_RAX:
>>> 0000000000000139
>>> [  103.626479] RAX: ffffffffffffffda RBX: 0000559a52495400 RCX:
>>> 00007f09984854d9
>>> [  103.626486] RDX: 0000000000000000 RSI: 0000559a52499900 RDI:
>>> 0000000000000006
>>> [  103.626493] RBP: 0000559a52499900 R08: 0000000000000000 R09:
>>> 0000000000000000
>>> [  103.626500] R10: 0000000000000006 R11: 0000000000000246 R12:
>>> 0000000000000000
>>> [  103.626508] R13: 0000559a52499b30 R14: 0000000000040000 R15:
>>> 0000000000000013
>>>
>>> [  103.626592] The buggy address belongs to the page:
>>> [  103.626665] page:ffffea000f2c6780 refcount:0 mapcount:0
>>> mapping:0000000000000000 index:0x0
>>> [  103.626675] flags: 0x2ffff0000000000() [  103.626686] raw: 
>>> 02ffff0000000000 0000000000000000 ffffea000f2c6788
>>> 0000000000000000
>>> [  103.626696] raw: 0000000000000000 0000000000000000 
>>> 00000000ffffffff
>>> 0000000000000000
>>> [  103.626702] page dumped because: kasan: bad access detected
>>>
>>> [  103.626742] addr ffff8883cb19ee38 is located in stack of task
>>> modprobe/1122 at offset 264 in frame:
>>> [  103.627233]  kfd_create_vcrat_image_gpu+0x0/0xb80 [amdgpu]
>>>
>>> [  103.627346] this frame has 3 objects:
>>> [  103.627405]  [32, 36) 'avail_size'
>>> [  103.627410]  [96, 120) 'local_mem_info'
>>> [  103.627466]  [160, 264) 'cu_info'
>>>
>>> [  103.627602] Memory state around the buggy address:
>>> [  103.627675]  ffff8883cb19ed00: 00 00 00 00 00 00 f1 f1 f1 f1 04 
>>> f4 f4
>>> f4 f2 f2
>>> [  103.627780]  ffff8883cb19ed80: f2 f2 00 00 00 f4 f2 f2 f2 f2 00 
>>> 00 00
>>> 00 00 00
>>> [  103.627885] >ffff8883cb19ee00: 00 00 00 00 00 00 00 f4 f4 f4 f3 
>>> f3 f3
>>> f3 00 00
>>> [  103.627989]                                         ^ [  
>>> 103.628065]  ffff8883cb19ee80: 00 00 00 00 00 00 00 00 00 00 00 00 
>>> 00
>>> 00 00 00
>>> [  103.628169]  ffff8883cb19ef00: f1 f1 f1 f1 00 f4 f4 f4 f3 f3 f3 
>>> f3 00
>>> 00 00 00
>>> [  103.628273]
>>> ==================================================================
>>>
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
--- Begin Message ---

 

MI100 HW Enablement Linux SW Stack(VBIOS, FW, DKMS Kernel Driver) Integration Test Report

Dashboard

Hardware info

AMDGPU Linux Stack

Linux Distro

Status

SUT-1 Configuration:

  • Motherboard: ASUS PRIME Z270-A
  • CPU: i7-7700K CPU @ 4.20GHz
  • Memory: Kingston DDR4 2133 8GB *2
  • ASIC: MI100 socket PA Non-Secure board revB 102-D34101-01

Ubuntu 18.04.3 LTS

PROMOTABLE

SUT-2 Configuration:

  • Motherboard: Supermicro X10DRG-OT (SYS-4028GR-TRT2)
  • CPU: Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
  • Memory: Micron DDR4 2667 MT/s 64GB *12
  • ASICMI100 102-D34302-00 PCIe Product Board 32GB (U/F) Non-Secure board XGMI 2P

Ubuntu 18.04.3 LTS

PROMOTABLE

SUT-3 Configuration:

  • Motherboard: Supermicro X10DRG-Q (SYS-7048GR-TR)
  • CPU: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
  • Memory: Micron DDR4 2133 MT/s 16GB *7
  • ASICMI100 102-D34302-00 PCIe Product Board 32GB (U/F) Non-Secure board *2 non-XGMI

Reference: 

MI100 VBIOS: http://home.amd.com/VideoBios/Video%20BIOS%20Releases/SingleASICRelease.asp?AsicName=MI100

ROCm build for MI100: http://rocm-ci/job/compute-rocm-dkms-no-npi/

How to replace kernel driver and FWs: How to install and replace kernel driver and FWs for MI100

Executive Summary

What's Current and New?

  • Outstanding issues:
    • Issue can be observed with VBIOS L18 on XGMI 2P but not on non-XGMI
      • SWDEV-207030 - [MI100] kfdtest subtests failed on XGMI 2P with large bar enabled Opened
  • Existing issues:
    • SWDEV-204604 - [MI100 XGMI] UCLK/SOCCLK/FCLK DPM are still disabled with XGMI enabled Opened
    • SWDEV-201443 - Linux Pro: KFDMemoryTest.BigBufferStressTest fails Assessed
  • VBIOS upgraded to v19
  • RLC FW upgrade to 21.1, SOS FW upgrade to SOS: 0x0017002a
  • Power Feature enablement status

Feature

SMU FW Ready

AMDGPU Kernel Ready

DPM_PREFETCHER

Yes

Yes

DPM_GFXCLK

Yes

Yes

DPM_UCLK

Yes

Checking on driver side
SWDEV-204604
Opened

DPM_SOCCLK

Yes

Checking on driver side
SWDEV-204604
Opened

DPM_FCLK

Yes

Checking on driver side
SWDEV-204604
Opened

DPM_XGMI

No

No

DS_GFXCLK

Yes

Yes

DS_SOCCLK

Yes

Yes

DS_LCLK

Yes

Yes

Require ASPM L1 support in Driver and M/B(Under discussion)

DS_FCLK

Yes

Yes

GFX_ULV

Yes

Yes

DPM_VCN

Yes

VCN disabled for PSP front door loading due to the issue: SWDEV-203022 Assessed

RSMU_SMN_CG

Yes

Yes

WAFL_CG

No

No

PPT

Yes

Yes

Depends on PPTable setting to enable 4 PPT(PPTable Not ready) or 1 PPT

TDC

Yes

Yes

APCC_PLUS

Yes

Pending on pptable release

VR0HOT

Yes

Yes

VR1HOT

No

No

FW_CTF

Yes

Yes

FAN CONTROL

Not POR

N/A

THERMAL CONTROL

Yes

Yes

OUT_OF_BAND_MONITOR

Yes

Yes

TEMP_DEPENDENT_VMIN

Yes

Pending on pptable release

GFX CG

NOT SMU feature

Yes

HDP CG

NOT SMU feature

Yes

SDMA CG

NOT SMU feature

Yes

MMHUB CG

NOT SMU feature

Yes

UMC CG

NOT SMU feature

Yes

DF CG

NOT SMU feature

Yes

ATHUB CG

NOT SMU feature

Yes

PSP CG

NOT SMU feature

Checking the readiness

User Mode Stable Power State

NOT SMU feature

Yes

Workload Aware Dynamic Power Management / User Power Control

Yes

Yes

Test Coverage

Test case

MI100 GPU

(D34101)

MI100 mGPU

(D34302*2 XGMI 2P)

MI100 mGPU

(D34302*2 non-XGMI)

Comments

Base

amdgpu_test

Basic Tests

PASS

PASS

PASS

BO Tests

PASS

PASS

PASS

VCN Tests

N/A

N/A

N/A

Skip VCN Test due to Skip VCN IP initialization after switch to FW front door loading.

SWDEV-203022 Assessed

VM Tests

PASS

PASS

PASS

Power

GFX DPM check

PASS

PASS

PASS

Force GFX DPM level check

PASS

PASS

PASS

GFX ULV check

PASS

PASS

PASS

DS GFXCLK check

PASS

PASS

PASS

DS SOCCLK check

PASS

FAIL

PASS

SWDEV-204604 Opened

DS FCLK check

PASS

FAIL

PASS

SWDEV-204604 Opened

ROCr/KFD

rocm_info

PASS

PASS

PASS

kfdtest

PASS

FAIL

PASS

  • KFDPerformanceTest.P2PBandWidthTest and KFDGraphicsInterop.RegisterForeignDeviceMem tests failed via XGMI on Large bar enabled
  • Existing issue with large size system memory

rocrtst

PASS

PASS

PASS

rocm_bandwidth_test

PASS

PASS

PASS

  • Using RBT built in rocm no-npi-dkms build#1060 to verify the data path passed.
  • Bad performance via XGMI

rocm-smi

PASS

PASS

PASS

rsmitst

PASS

PASS

PASS

OCL

ocltst

PASS

PASS

PASS

HIP

hipsamples_utils

PASS

PASS

PASS

Frameworks

Tensorflow

tf_convolutional_quick_test

PASS

PASS

PASS

Pytorch unit test

test_autograd

PASS

PASS

PASS

test_nn

PASS

PASS

PASS

MIOpen unit test

MIOpen (HIP)

PASS

PASS

PASS

MIOpen(OpenCL)

PASS

PASS

PASS

Math libs

rocBLAS

PASS

PASS

PASS

Run quick tests only

hipBLAS

PASS

PASS

PASS

Additional Information

Note: All tests run with latest VBIOS/FW/Kernel and ROCm LKG build

Defect list

Key

Summary

triage assignment

target sw release

Assignee

SWDEV-207030

[MI100] kfdtest subtests failed on XGMI 2P with large bar enabled

VBIOS

 

Tao, Cherry

SWDEV-204604

[MI100 XGMI] UCLK/SOCCLK/FCLK DPM are still disabled with XGMI enabled

Base dGPU Enablement

Quan, Evan

SWDEV-203022

MI100 VCN engine hangs after FW loading with PSP

Multimedia

Staging-DRM-Next

Zhu, James

SWDEV-202188

[MI100] HSA_STATUS_ERROR_OUT_OF_RESOURCES when run rocminfo on Gigabyte Eypc platform

HSA KFD

Keely, Sean

SWDEV-201817

[MI100] rocrtst test failed on Gigabyte Eypc platform

Runtime

Keely, Sean

SWDEV-200753

[ROCm QA][no-npi-dkms][MI100] XGMI Links not working with 4P/2P

Base

ROC-Master

Clements, John

 

BCC: Rose, Danny <Danny.Rose@xxxxxxx>; dl.MLSE.QA <dl.MLSE.QA@xxxxxxx>; Weyman, Jeff <Jeffrey.Weyman@xxxxxxx>; Fan, Fai <Fai.Fan@xxxxxxx>; Marsan, Luugi <Luugi.Marsan@xxxxxxx>; sw.dl.ERP.LuugiM <sw.dl.ERP.LuugiM@xxxxxxx>; dl.srdc_lnx_mi100 <dl.srdc_lnx_mi100@xxxxxxx>; Tim Writer <Tim.Writer@xxxxxxx>; dl.SRDC_SW_Linux_dev dl.SRDC_SW_Linux_dev@xxxxxxx; Guo, Miaomiao <Miaomiao.Guo@amd.com>; Yao, Yoyo <Yoyo.Yao@amd.com>; Jain, Praveen <Praveen.Jain@amd.com>; Arora, Jitesh <Jitesh.Arora@xxxxxxx>; Zhu, James <James.Zhu@amd.com>; Bridgman, John <John.Bridgman@amd.com>; Islam, Jamin <Jamin.Islam@amd.com>; Koohestani, Ehsan <Ehsan.Koohestani@amd.com>; Wang, Cloud <Cloud.Wang@amd.com>; Gong, Yakov <Yakov.Gong@amd.com>; Yang, Alice (SRDC 3D) <Alice1.Yang@xxxxxxx>; Ma, Sigil <Sigil.Ma@amd.com>; Li, Colin <Colin.Li@amd.com>; Tang, Moon <Moon.Tang@amd.com>; Khan, Irfan <Irfan.Khan@amd.com>; Nasim, Kam <Kam.Nasim@amd.com>; Shavakh, Shadi <Shadi.Shavakh@amd.com>; Lotfi, Khatereh <Khatereh.Lotfi@amd.com>; Feng, Haifeng <Haifeng.Feng@amd.com>; Liang, Ming <Ming.Liang@amd.com>; "Min.Xu2@amd.com"dl.MI100_CTA <dl.MI100_CTA@amd.com>; Chen, Joe <Joe.Chen@amd.com>

 

 

 

 

Thanks,

Candice Li


--- End Message ---
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux