RE: [PATCH] drm/amdgpu: Need to set the baco cap before baco reset

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Philip,
     Sorry for introduce this issue for you. From the code, I couldn't see any issue. And I have tested the code in my Vega10, it is OK.  So I think this is the kfd specific issue, but I couldn't reproduce issue on my platform. Could you create an ticket, and assign to me, and share me your platform, so I could debug it and fix it today.

Best wishes
Emily Deng

>-----Original Message-----
>From: Yang, Philip <Philip.Yang@xxxxxxx>
>Sent: Friday, June 14, 2019 10:16 PM
>To: Deng, Emily <Emily.Deng@xxxxxxx>; Russell, Kent
><Kent.Russell@xxxxxxx>; Quan, Evan <Evan.Quan@xxxxxxx>; amd-
>gfx@xxxxxxxxxxxxxxxxxxxxx
>Subject: Re: [PATCH] drm/amdgpu: Need to set the baco cap before baco reset
>
>Hi Emily,
>
>I am not familiar with vbios and driver init part, just based on my experience,
>the patch don't modify amdgpu_get_bios but move amdgpu_get_bios to
>amdgpu_device_ip_early_init from amdgpu_device_init, so amdgpu_get_bios
>is executed earlier. The kernel error message "BUG:
>kernel NULL pointer dereference" means something is not initialized.
>Please review the change. This issue blocks rocm release now.
>
>Regards,
>Philip
>
>On 2019-06-13 11:19 p.m., Deng, Emily wrote:
>> Hi Russell,
>>       This patch will read vbios, and parse vbios to get the baco reset feature
>bit.  From the call trace, it shows error in " amdgpu_get_bios ", but this patch
>don't modify amdgpu_get_bios, and code before amdgpu_get_bios. Please
>first check why it will has error when read vbios.
>>
>> Best wishes
>> Emily Deng
>>
>>
>>
>>> -----Original Message-----
>>> From: Russell, Kent <Kent.Russell@xxxxxxx>
>>> Sent: Thursday, June 13, 2019 7:11 PM
>>> To: Quan, Evan <Evan.Quan@xxxxxxx>; Deng, Emily
><Emily.Deng@xxxxxxx>;
>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>> Cc: Deng, Emily <Emily.Deng@xxxxxxx>
>>> Subject: RE: [PATCH] drm/amdgpu: Need to set the baco cap before baco
>>> reset
>>>
>>> Hi Emily,
>>>
>>> This patch caused a regression on MI25 (Chip 6860, VBIOS
>>> 113-D0513700-001) machines where the driver would not boot. Note that
>>> this was not seen on
>>> Vega10 Frontier (Chip 6863, VBIOS 113-D0501100-X09) or Radeon64 (Chip
>>> 697f). Reverting this patch resolved the issue with no other work
>>> required and was confirmed on all 3 machines.
>>>
>>> Here is the dmesg:
>>>
>>> [    3.908653] amdgpu 0000:23:00.0: BAR 6: can't assign [??? 0x00000000
>flags
>>> 0x20000000] (bogus alignment)
>>> [    3.908692] BUG: kernel NULL pointer dereference, address:
>>> 0000000000000008
>>> [    3.908716] #PF: supervisor read access in kernel mode
>>> [    3.908734] #PF: error_code(0x0000) - not-present page
>>> [    3.908753] PGD 0 P4D 0
>>> [    3.908767] Oops: 0000 [#1] SMP NOPTI
>>> [    3.909293] CPU: 8 PID: 409 Comm: kworker/8:1 Not tainted 5.2.0-rc1-kfd-
>>> compute-roc-master-10734 #1
>>> [    3.909753] Hardware name: Inventec     P47
>WC2071019001
>>> /P47             , BIOS 0.64 04/09/2018
>>> [    3.910534] Workqueue: events work_for_cpu_fn
>>> [    3.910953] RIP: 0010:amdgpu_get_bios+0x3aa/0x580 [amdgpu]
>>> [    3.911314] Code: c0 48 c7 44 24 5c 00 00 00 00 48 c7 84 24 90 00 00 00 00
>00
>>> 00 00 48 89 d9 48 29 f9 83 c1 3c c1 e9 03 f3 48 ab 49 8b 44 24 38
>>> <48> 8b 40 08
>>> 48 85 c0 74 24 ba 3c 00 00 00 48 89 de 4c 89 e7 ff d0
>>> [    3.912069] RSP: 0018:ffffa27dce28fc50 EFLAGS: 00010212
>>> [    3.912502] RAX: 0000000000000000 RBX: ffffa27dce28fcac RCX:
>>> 0000000000000000
>>> [    3.912980] RDX: 0000000000000000 RSI: 0000000000000082 RDI:
>>> ffffa27dce28fce8
>>> [    3.913467] RBP: 0000000000000000 R08: 0000000000000001 R09:
>>> 000000000000079a
>>> [    3.913940] R10: 0000000000000000 R11: 0000000000000001 R12:
>>> ffff88d657af0000
>>> [    3.914349] R13: ffffffffc0c38120 R14: ffffa27dce28fc68 R15:
>>> ffff88d657af0000
>>> [    3.914767] FS:  0000000000000000(0000) GS:ffff88d65f400000(0000)
>>> knlGS:0000000000000000
>>> [    3.915203] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [    3.915637] CR2: 0000000000000008 CR3: 0000003e7540a000 CR4:
>>> 00000000003406e0
>>> [    3.916075] Call Trace:
>>> [    3.916522]  ? pcie_capability_clear_and_set_word+0x53/0x80
>>> [    3.917014]  amdgpu_device_init+0x923/0x1820 [amdgpu]
>>> [    3.917515]  amdgpu_driver_load_kms+0x71/0x310 [amdgpu]
>>> [    3.917997]  drm_dev_register+0x113/0x1a0 [drm]
>>> [    3.918514]  amdgpu_pci_probe+0xb8/0x150 [amdgpu]
>>> [    3.919003]  ? __pm_runtime_resume+0x54/0x70
>>> [    3.919270] usb 1-2: New USB device found, idVendor=14dd,
>idProduct=1005,
>>> bcdDevice= 0.00
>>> [    3.919498]  local_pci_probe+0x3d/0x90
>>> [    3.919503]  ? __schedule+0x3de/0x690
>>> [    3.920374] usb 1-2: New USB device strings: Mfr=1, Product=2,
>>> SerialNumber=3
>>> [    3.921137]  work_for_cpu_fn+0x10/0x20
>>> [    3.922028] usb 1-2: Product: D2CIM-VUSB
>>> [    3.922815]  process_one_work+0x159/0x3e0
>>> [    3.923633] usb 1-2: Manufacturer: Raritan
>>> [    3.923635] usb 1-2: SerialNumber: EFFB212D0A6E32F
>>> [    3.924416]  worker_thread+0x22b/0x440
>>> [    3.924419]  ? rescuer_thread+0x350/0x350
>>> [    3.927812]  kthread+0xf6/0x130
>>> [    3.928157] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
>>> [    3.928365]  ? kthread_destroy_worker+0x40/0x40
>>> [    3.929401] ata1.00: ATA-10: INTEL SSDSC2KG960G7, SCV10100, max
>>> UDMA/133
>>> [    3.930101]  ret_from_fork+0x1f/0x30
>>> [    3.930103] Modules linked in: amdgpu(+) crct10dif_pclmul crc32_pclmul
>>> ghash_clmulni_intel ast amd_iommu_v2 aesni_intel gpu_sched
>>> i2c_algo_bit
>>> aes_x86_64 ttm crypto_simd drm_kms_helper cryptd glue_helper
>>> syscopyarea sysfillrect ahci sysimgblt libahci fb_sys_fops ixgbe(+)
>>> drm dca mdio
>>> [    3.930965] ata1.00: 1875385008 sectors, multi 1: LBA48 NCQ (depth 32)
>>> [    3.931085] ata1.00: configured for UDMA/133
>>> [    3.931809] CR2: 0000000000000008
>>> [    3.934723] scsi 0:0:0:0: Direct-Access     ATA      INTEL SSDSC2KG96 0100
>PQ:
>>> 0 ANSI: 5
>>>
>>>
>>> Thanks!
>>>
>>> Kent
>>>
>>> -----Original Message-----
>>> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of
>>> Quan, Evan
>>> Sent: Monday, May 27, 2019 9:17 PM
>>> To: Deng, Emily <Emily.Deng@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>> Cc: Deng, Emily <Emily.Deng@xxxxxxx>
>>> Subject: RE: [PATCH] drm/amdgpu: Need to set the baco cap before baco
>>> reset
>>>
>>> Reviewed-by: Evan Quan <evan.quan@xxxxxxx>
>>>
>>>> -----Original Message-----
>>>> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of
>>>> Emily Deng
>>>> Sent: Monday, May 27, 2019 3:43 PM
>>>> To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>>> Cc: Deng, Emily <Emily.Deng@xxxxxxx>
>>>> Subject: [PATCH] drm/amdgpu: Need to set the baco cap before baco
>>>> reset
>>>>
>>>> For passthrough, after rebooted the VM, driver will do a baco reset
>>>> before doing other driver initialization during loading  driver. For
>>>> doing the baco reset, it will first check the baco reset capability.
>>>> So first need to set the cap from the vbios information or baco
>>>> reset won't
>>> be enabled.
>>>>
>>>> Signed-off-by: Emily Deng <Emily.Deng@xxxxxxx>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c         | 24 ++++++++++--
>---
>>>> -------
>>>>   drivers/gpu/drm/amd/amdgpu/soc15.c                 |  3 ++-
>>>>   drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c |  4 ++++
>>>>   .../amd/powerplay/hwmgr/vega10_processpptables.c   | 24
>>>> ++++++++++++++++++++++
>>>>   .../amd/powerplay/hwmgr/vega10_processpptables.h   |  1 +
>>>>   5 files changed, 42 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index b6ded84..7a8c220 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -1541,6 +1541,17 @@ static int amdgpu_device_ip_early_init(struct
>>>> amdgpu_device *adev)
>>>>   	if (amdgpu_sriov_vf(adev))
>>>>   		adev->pm.pp_feature &= ~PP_GFXOFF_MASK;
>>>>
>>>> +	/* Read BIOS */
>>>> +	if (!amdgpu_get_bios(adev))
>>>> +		return -EINVAL;
>>>> +
>>>> +	r = amdgpu_atombios_init(adev);
>>>> +	if (r) {
>>>> +		dev_err(adev->dev, "amdgpu_atombios_init failed\n");
>>>> +		amdgpu_vf_error_put(adev,
>>>> AMDGIM_ERROR_VF_ATOMBIOS_INIT_FAIL, 0, 0);
>>>> +		return r;
>>>> +	}
>>>> +
>>>>   	for (i = 0; i < adev->num_ip_blocks; i++) {
>>>>   		if ((amdgpu_ip_block_mask & (1 << i)) == 0) {
>>>>   			DRM_ERROR("disabled ip block: %d <%s>\n", @@ -
>>>> 2591,19 +2602,6 @@ int amdgpu_device_init(struct amdgpu_device
>*adev,
>>>>   		goto fence_driver_init;
>>>>   	}
>>>>
>>>> -	/* Read BIOS */
>>>> -	if (!amdgpu_get_bios(adev)) {
>>>> -		r = -EINVAL;
>>>> -		goto failed;
>>>> -	}
>>>> -
>>>> -	r = amdgpu_atombios_init(adev);
>>>> -	if (r) {
>>>> -		dev_err(adev->dev, "amdgpu_atombios_init failed\n");
>>>> -		amdgpu_vf_error_put(adev,
>>>> AMDGIM_ERROR_VF_ATOMBIOS_INIT_FAIL, 0, 0);
>>>> -		goto failed;
>>>> -	}
>>>> -
>>>>   	/* detect if we are with an SRIOV vbios */
>>>>   	amdgpu_device_detect_sriov_bios(adev);
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c
>>>> b/drivers/gpu/drm/amd/amdgpu/soc15.c
>>>> index 78bd4fc..d9fdd95 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/soc15.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
>>>> @@ -764,7 +764,8 @@ static bool soc15_need_reset_on_init(struct
>>>> amdgpu_device *adev)
>>>>   	/* Just return false for soc15 GPUs.  Reset does not seem to
>>>>   	 * be necessary.
>>>>   	 */
>>>> -	return false;
>>>> +	if (!amdgpu_passthrough(adev))
>>>> +		return false;
>>>>
>>>>   	if (adev->flags & AMD_IS_APU)
>>>>   		return false;
>>>> diff --git a/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c
>>>> b/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c
>>>> index ce6aeb5..1d9bb29 100644
>>>> --- a/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c
>>>> +++ b/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c
>>>> @@ -5311,8 +5311,12 @@ static const struct pp_hwmgr_func
>>>> vega10_hwmgr_funcs = {
>>>>
>>>>   int vega10_hwmgr_init(struct pp_hwmgr *hwmgr)  {
>>>> +	struct amdgpu_device *adev = hwmgr->adev;
>>>> +
>>>>   	hwmgr->hwmgr_func = &vega10_hwmgr_funcs;
>>>>   	hwmgr->pptable_func = &vega10_pptable_funcs;
>>>> +	if (amdgpu_passthrough(adev))
>>>> +		return vega10_baco_set_cap(hwmgr);
>>>>
>>>>   	return 0;
>>>>   }
>>>> diff --git
>>>> a/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_processpptables.c
>>>> b/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_processpptables.c
>>>> index b6767d7..83d22cd 100644
>>>> --- a/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_processpptables.c
>>>> +++
>b/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_processpptables.c
>>>> @@ -1371,3 +1371,27 @@ int vega10_get_powerplay_table_entry(struct
>>>> pp_hwmgr *hwmgr,
>>>>
>>>>   	return result;
>>>>   }
>>>> +
>>>> +int vega10_baco_set_cap(struct pp_hwmgr *hwmgr) {
>>>> +	int result = 0;
>>>> +
>>>> +	const ATOM_Vega10_POWERPLAYTABLE *powerplay_table;
>>>> +
>>>> +	powerplay_table = get_powerplay_table(hwmgr);
>>>> +
>>>> +	PP_ASSERT_WITH_CODE((powerplay_table != NULL),
>>>> +		"Missing PowerPlay Table!", return -1);
>>>> +
>>>> +	result = check_powerplay_tables(hwmgr, powerplay_table);
>>>> +
>>>> +	PP_ASSERT_WITH_CODE((result == 0),
>>>> +			    "check_powerplay_tables failed", return result);
>>>> +
>>>> +	set_hw_cap(
>>>> +			hwmgr,
>>>> +			0 != (le32_to_cpu(powerplay_table->ulPlatformCaps)
>>>> & ATOM_VEGA10_PP_PLATFORM_CAP_BACO),
>>>> +			PHM_PlatformCaps_BACO);
>>>> +	return result;
>>>> +}
>>>> +
>>>> diff --git
>>>> a/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_processpptables.h
>>>> b/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_processpptables.h
>>>> index d83ed2a..da5fbec 100644
>>>> ---
>a/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_processpptables.h
>>>> +++
>b/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_processpptables.h
>>>> @@ -59,4 +59,5 @@ extern int
>>>> vega10_get_number_of_powerplay_table_entries(struct pp_hwmgr
>>> *hwmgr);
>>>> extern int vega10_get_powerplay_table_entry(struct pp_hwmgr *hwmgr,
>>>> uint32_t entry_index,
>>>>   		struct pp_power_state *power_state, int
>>> (*call_back_func)(struct
>>>> pp_hwmgr *, void *,
>>>>   				struct pp_power_state *, void *, uint32_t));
>>>> +extern int vega10_baco_set_cap(struct pp_hwmgr *hwmgr);
>>>>   #endif
>>>> --
>>>> 2.7.4
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux