Thanks Emily, this fixed the issue. Verified-By: Kent Russell <kent.russell@xxxxxxx> Kent -----Original Message----- From: Deng, Emily <Emily.Deng@xxxxxxx> Sent: Sunday, June 16, 2019 11:53 PM To: Yang, Philip <Philip.Yang@xxxxxxx>; Russell, Kent <Kent.Russell@xxxxxxx>; Quan, Evan <Evan.Quan@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx Subject: RE: [PATCH] drm/amdgpu: Need to set the baco cap before baco reset Hi Philip, Could you help to try whether the attachment patch could help with the issue encounter? If it is OK, then I will send this patch out to review. Best wishes Emily Deng >-----Original Message----- >From: Deng, Emily >Sent: Monday, June 17, 2019 10:50 AM >To: Yang, Philip <Philip.Yang@xxxxxxx>; Russell, Kent ><Kent.Russell@xxxxxxx>; Quan, Evan <Evan.Quan@xxxxxxx>; amd- >gfx@xxxxxxxxxxxxxxxxxxxxx >Subject: RE: [PATCH] drm/amdgpu: Need to set the baco cap before baco >reset > >Hi Philip, > Sorry for introduce this issue for you. From the code, I couldn't >see any issue. And I have tested the code in my Vega10, it is OK. So I >think this is the kfd specific issue, but I couldn't reproduce issue on >my platform. Could you create an ticket, and assign to me, and share me >your platform, so I could debug it and fix it today. > >Best wishes >Emily Deng > >>-----Original Message----- >>From: Yang, Philip <Philip.Yang@xxxxxxx> >>Sent: Friday, June 14, 2019 10:16 PM >>To: Deng, Emily <Emily.Deng@xxxxxxx>; Russell, Kent >><Kent.Russell@xxxxxxx>; Quan, Evan <Evan.Quan@xxxxxxx>; amd- >>gfx@xxxxxxxxxxxxxxxxxxxxx >>Subject: Re: [PATCH] drm/amdgpu: Need to set the baco cap before baco >>reset >> >>Hi Emily, >> >>I am not familiar with vbios and driver init part, just based on my >>experience, the patch don't modify amdgpu_get_bios but move >>amdgpu_get_bios to amdgpu_device_ip_early_init from >>amdgpu_device_init, so amdgpu_get_bios is executed earlier. The kernel error message "BUG: >>kernel NULL pointer dereference" means something is not initialized. >>Please review the change. This issue blocks rocm release now. >> >>Regards, >>Philip >> >>On 2019-06-13 11:19 p.m., Deng, Emily wrote: >>> Hi Russell, >>> This patch will read vbios, and parse vbios to get the baco >>> reset feature >>bit. From the call trace, it shows error in " amdgpu_get_bios ", but >>this patch don't modify amdgpu_get_bios, and code before >>amdgpu_get_bios. Please first check why it will has error when read vbios. >>> >>> Best wishes >>> Emily Deng >>> >>> >>> >>>> -----Original Message----- >>>> From: Russell, Kent <Kent.Russell@xxxxxxx> >>>> Sent: Thursday, June 13, 2019 7:11 PM >>>> To: Quan, Evan <Evan.Quan@xxxxxxx>; Deng, Emily >><Emily.Deng@xxxxxxx>; >>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>> Cc: Deng, Emily <Emily.Deng@xxxxxxx> >>>> Subject: RE: [PATCH] drm/amdgpu: Need to set the baco cap before >>>> baco reset >>>> >>>> Hi Emily, >>>> >>>> This patch caused a regression on MI25 (Chip 6860, VBIOS >>>> 113-D0513700-001) machines where the driver would not boot. Note >>>> that this was not seen on >>>> Vega10 Frontier (Chip 6863, VBIOS 113-D0501100-X09) or Radeon64 >>>> (Chip 697f). Reverting this patch resolved the issue with no other >>>> work required and was confirmed on all 3 machines. >>>> >>>> Here is the dmesg: >>>> >>>> [ 3.908653] amdgpu 0000:23:00.0: BAR 6: can't assign [??? 0x00000000 >>flags >>>> 0x20000000] (bogus alignment) >>>> [ 3.908692] BUG: kernel NULL pointer dereference, address: >>>> 0000000000000008 >>>> [ 3.908716] #PF: supervisor read access in kernel mode >>>> [ 3.908734] #PF: error_code(0x0000) - not-present page >>>> [ 3.908753] PGD 0 P4D 0 >>>> [ 3.908767] Oops: 0000 [#1] SMP NOPTI >>>> [ 3.909293] CPU: 8 PID: 409 Comm: kworker/8:1 Not tainted 5.2.0-rc1- >kfd- >>>> compute-roc-master-10734 #1 >>>> [ 3.909753] Hardware name: Inventec P47 >>WC2071019001 >>>> /P47 , BIOS 0.64 04/09/2018 >>>> [ 3.910534] Workqueue: events work_for_cpu_fn >>>> [ 3.910953] RIP: 0010:amdgpu_get_bios+0x3aa/0x580 [amdgpu] >>>> [ 3.911314] Code: c0 48 c7 44 24 5c 00 00 00 00 48 c7 84 24 90 00 00 00 >00 >>00 >>>> 00 00 48 89 d9 48 29 f9 83 c1 3c c1 e9 03 f3 48 ab 49 8b 44 24 38 >>>> <48> 8b 40 08 >>>> 48 85 c0 74 24 ba 3c 00 00 00 48 89 de 4c 89 e7 ff d0 >>>> [ 3.912069] RSP: 0018:ffffa27dce28fc50 EFLAGS: 00010212 >>>> [ 3.912502] RAX: 0000000000000000 RBX: ffffa27dce28fcac RCX: >>>> 0000000000000000 >>>> [ 3.912980] RDX: 0000000000000000 RSI: 0000000000000082 RDI: >>>> ffffa27dce28fce8 >>>> [ 3.913467] RBP: 0000000000000000 R08: 0000000000000001 R09: >>>> 000000000000079a >>>> [ 3.913940] R10: 0000000000000000 R11: 0000000000000001 R12: >>>> ffff88d657af0000 >>>> [ 3.914349] R13: ffffffffc0c38120 R14: ffffa27dce28fc68 R15: >>>> ffff88d657af0000 >>>> [ 3.914767] FS: 0000000000000000(0000) GS:ffff88d65f400000(0000) >>>> knlGS:0000000000000000 >>>> [ 3.915203] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>> [ 3.915637] CR2: 0000000000000008 CR3: 0000003e7540a000 CR4: >>>> 00000000003406e0 >>>> [ 3.916075] Call Trace: >>>> [ 3.916522] ? pcie_capability_clear_and_set_word+0x53/0x80 >>>> [ 3.917014] amdgpu_device_init+0x923/0x1820 [amdgpu] >>>> [ 3.917515] amdgpu_driver_load_kms+0x71/0x310 [amdgpu] >>>> [ 3.917997] drm_dev_register+0x113/0x1a0 [drm] >>>> [ 3.918514] amdgpu_pci_probe+0xb8/0x150 [amdgpu] >>>> [ 3.919003] ? __pm_runtime_resume+0x54/0x70 >>>> [ 3.919270] usb 1-2: New USB device found, idVendor=14dd, >>idProduct=1005, >>>> bcdDevice= 0.00 >>>> [ 3.919498] local_pci_probe+0x3d/0x90 >>>> [ 3.919503] ? __schedule+0x3de/0x690 >>>> [ 3.920374] usb 1-2: New USB device strings: Mfr=1, Product=2, >>>> SerialNumber=3 >>>> [ 3.921137] work_for_cpu_fn+0x10/0x20 >>>> [ 3.922028] usb 1-2: Product: D2CIM-VUSB >>>> [ 3.922815] process_one_work+0x159/0x3e0 >>>> [ 3.923633] usb 1-2: Manufacturer: Raritan >>>> [ 3.923635] usb 1-2: SerialNumber: EFFB212D0A6E32F >>>> [ 3.924416] worker_thread+0x22b/0x440 >>>> [ 3.924419] ? rescuer_thread+0x350/0x350 >>>> [ 3.927812] kthread+0xf6/0x130 >>>> [ 3.928157] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) >>>> [ 3.928365] ? kthread_destroy_worker+0x40/0x40 >>>> [ 3.929401] ata1.00: ATA-10: INTEL SSDSC2KG960G7, SCV10100, max >>>> UDMA/133 >>>> [ 3.930101] ret_from_fork+0x1f/0x30 >>>> [ 3.930103] Modules linked in: amdgpu(+) crct10dif_pclmul crc32_pclmul >>>> ghash_clmulni_intel ast amd_iommu_v2 aesni_intel gpu_sched >>>> i2c_algo_bit >>>> aes_x86_64 ttm crypto_simd drm_kms_helper cryptd glue_helper >>>> syscopyarea sysfillrect ahci sysimgblt libahci fb_sys_fops ixgbe(+) >>>> drm dca mdio >>>> [ 3.930965] ata1.00: 1875385008 sectors, multi 1: LBA48 NCQ (depth 32) >>>> [ 3.931085] ata1.00: configured for UDMA/133 >>>> [ 3.931809] CR2: 0000000000000008 >>>> [ 3.934723] scsi 0:0:0:0: Direct-Access ATA INTEL SSDSC2KG96 0100 >>PQ: >>>> 0 ANSI: 5 >>>> >>>> >>>> Thanks! >>>> >>>> Kent >>>> >>>> -----Original Message----- >>>> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of >>>> Quan, Evan >>>> Sent: Monday, May 27, 2019 9:17 PM >>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>> Cc: Deng, Emily <Emily.Deng@xxxxxxx> >>>> Subject: RE: [PATCH] drm/amdgpu: Need to set the baco cap before >>>> baco reset >>>> >>>> Reviewed-by: Evan Quan <evan.quan@xxxxxxx> >>>> >>>>> -----Original Message----- >>>>> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of >>>>> Emily Deng >>>>> Sent: Monday, May 27, 2019 3:43 PM >>>>> To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>>> Cc: Deng, Emily <Emily.Deng@xxxxxxx> >>>>> Subject: [PATCH] drm/amdgpu: Need to set the baco cap before baco >>>>> reset >>>>> >>>>> For passthrough, after rebooted the VM, driver will do a baco >>>>> reset before doing other driver initialization during loading driver. >>>>> For doing the baco reset, it will first check the baco reset capability. >>>>> So first need to set the cap from the vbios information or baco >>>>> reset won't >>>> be enabled. >>>>> >>>>> Signed-off-by: Emily Deng <Emily.Deng@xxxxxxx> >>>>> --- >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24 >++++++++++-- >>--- >>>>> ------- >>>>> drivers/gpu/drm/amd/amdgpu/soc15.c | 3 ++- >>>>> drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c | 4 ++++ >>>>> .../amd/powerplay/hwmgr/vega10_processpptables.c | 24 >>>>> ++++++++++++++++++++++ >>>>> .../amd/powerplay/hwmgr/vega10_processpptables.h | 1 + >>>>> 5 files changed, 42 insertions(+), 14 deletions(-) >>>>> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>> index b6ded84..7a8c220 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>> @@ -1541,6 +1541,17 @@ static int >>>>> amdgpu_device_ip_early_init(struct >>>>> amdgpu_device *adev) >>>>> if (amdgpu_sriov_vf(adev)) >>>>> adev->pm.pp_feature &= ~PP_GFXOFF_MASK; >>>>> >>>>> + /* Read BIOS */ >>>>> + if (!amdgpu_get_bios(adev)) >>>>> + return -EINVAL; >>>>> + >>>>> + r = amdgpu_atombios_init(adev); >>>>> + if (r) { >>>>> + dev_err(adev->dev, "amdgpu_atombios_init failed\n"); >>>>> + amdgpu_vf_error_put(adev, >>>>> AMDGIM_ERROR_VF_ATOMBIOS_INIT_FAIL, 0, 0); >>>>> + return r; >>>>> + } >>>>> + >>>>> for (i = 0; i < adev->num_ip_blocks; i++) { >>>>> if ((amdgpu_ip_block_mask & (1 << i)) == 0) { >>>>> DRM_ERROR("disabled ip block: %d <%s>\n", @@ - >>>>> 2591,19 +2602,6 @@ int amdgpu_device_init(struct amdgpu_device >>*adev, >>>>> goto fence_driver_init; >>>>> } >>>>> >>>>> - /* Read BIOS */ >>>>> - if (!amdgpu_get_bios(adev)) { >>>>> - r = -EINVAL; >>>>> - goto failed; >>>>> - } >>>>> - >>>>> - r = amdgpu_atombios_init(adev); >>>>> - if (r) { >>>>> - dev_err(adev->dev, "amdgpu_atombios_init failed\n"); >>>>> - amdgpu_vf_error_put(adev, >>>>> AMDGIM_ERROR_VF_ATOMBIOS_INIT_FAIL, 0, 0); >>>>> - goto failed; >>>>> - } >>>>> - >>>>> /* detect if we are with an SRIOV vbios */ >>>>> amdgpu_device_detect_sriov_bios(adev); >>>>> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c >>>>> b/drivers/gpu/drm/amd/amdgpu/soc15.c >>>>> index 78bd4fc..d9fdd95 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/soc15.c >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c >>>>> @@ -764,7 +764,8 @@ static bool soc15_need_reset_on_init(struct >>>>> amdgpu_device *adev) >>>>> /* Just return false for soc15 GPUs. Reset does not seem to >>>>> * be necessary. >>>>> */ >>>>> - return false; >>>>> + if (!amdgpu_passthrough(adev)) >>>>> + return false; >>>>> >>>>> if (adev->flags & AMD_IS_APU) >>>>> return false; >>>>> diff --git a/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c >>>>> b/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c >>>>> index ce6aeb5..1d9bb29 100644 >>>>> --- a/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c >>>>> +++ b/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c >>>>> @@ -5311,8 +5311,12 @@ static const struct pp_hwmgr_func >>>>> vega10_hwmgr_funcs = { >>>>> >>>>> int vega10_hwmgr_init(struct pp_hwmgr *hwmgr) { >>>>> + struct amdgpu_device *adev = hwmgr->adev; >>>>> + >>>>> hwmgr->hwmgr_func = &vega10_hwmgr_funcs; >>>>> hwmgr->pptable_func = &vega10_pptable_funcs; >>>>> + if (amdgpu_passthrough(adev)) >>>>> + return vega10_baco_set_cap(hwmgr); >>>>> >>>>> return 0; >>>>> } >>>>> diff --git >>>>> a/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_processpptables.c >>>>> b/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_processpptables.c >>>>> index b6767d7..83d22cd 100644 >>>>> --- >a/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_processpptables.c >>>>> +++ >>b/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_processpptables.c >>>>> @@ -1371,3 +1371,27 @@ int vega10_get_powerplay_table_entry(struct >>>>> pp_hwmgr *hwmgr, >>>>> >>>>> return result; >>>>> } >>>>> + >>>>> +int vega10_baco_set_cap(struct pp_hwmgr *hwmgr) { >>>>> + int result = 0; >>>>> + >>>>> + const ATOM_Vega10_POWERPLAYTABLE *powerplay_table; >>>>> + >>>>> + powerplay_table = get_powerplay_table(hwmgr); >>>>> + >>>>> + PP_ASSERT_WITH_CODE((powerplay_table != NULL), >>>>> + "Missing PowerPlay Table!", return -1); >>>>> + >>>>> + result = check_powerplay_tables(hwmgr, powerplay_table); >>>>> + >>>>> + PP_ASSERT_WITH_CODE((result == 0), >>>>> + "check_powerplay_tables failed", return result); >>>>> + >>>>> + set_hw_cap( >>>>> + hwmgr, >>>>> + 0 != (le32_to_cpu(powerplay_table->ulPlatformCaps) >>>>> & ATOM_VEGA10_PP_PLATFORM_CAP_BACO), >>>>> + PHM_PlatformCaps_BACO); >>>>> + return result; >>>>> +} >>>>> + >>>>> diff --git >>>>> a/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_processpptables.h >>>>> b/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_processpptables.h >>>>> index d83ed2a..da5fbec 100644 >>>>> --- >>a/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_processpptables.h >>>>> +++ >>b/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_processpptables.h >>>>> @@ -59,4 +59,5 @@ extern int >>>>> vega10_get_number_of_powerplay_table_entries(struct pp_hwmgr >>>> *hwmgr); >>>>> extern int vega10_get_powerplay_table_entry(struct pp_hwmgr >>>>> *hwmgr, uint32_t entry_index, >>>>> struct pp_power_state *power_state, int >>>> (*call_back_func)(struct >>>>> pp_hwmgr *, void *, >>>>> struct pp_power_state *, void *, uint32_t)); >>>>> +extern int vega10_baco_set_cap(struct pp_hwmgr *hwmgr); >>>>> #endif >>>>> -- >>>>> 2.7.4 >>>>> >>>>> _______________________________________________ >>>>> amd-gfx mailing list >>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>> _______________________________________________ >>>> amd-gfx mailing list >>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>> _______________________________________________ >>> amd-gfx mailing list >>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>> _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx