RE: [PATCH] drm/amdgpu: add support to SMU debug option

"Yu, Lang" <Lang.Yu@xxxxxxx> · Thu, 2 Dec 2021 09:31:38 +0000

[AMD Official Use Only]

>-----Original Message-----
>From: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>
>Sent: Thursday, December 2, 2021 12:01 AM
>To: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>; Yu, Lang
><Lang.Yu@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>; amd-
>gfx@xxxxxxxxxxxxxxxxxxxxx
>Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Lazar, Lijo
><Lijo.Lazar@xxxxxxx>; Huang, Ray <Ray.Huang@xxxxxxx>
>Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>
>
>On 2021-12-01 8:11 a.m., Christian König wrote:
>> Adding Andrey as well.
>>
>> Am 01.12.21 um 12:37 schrieb Yu, Lang:
>>> [SNIP]
>>>>>>>>> + BUG_ON(unlikely(smu->smu_debug_mode) && res);
>>>>>>>> BUG_ON() really crashes the kernel and is only allowed if we
>>>>>>>> prevent further data corruption with that.
>>>>>>>>
>>>>>>>> Most of the time WARN_ON() is more appropriate, but I can't
>>>>>>>> fully judge here since I don't know the SMU code well enough.
>>>>>>> This is what SMU FW guys want. They want "user-visible
>>>>>>> (potentially
>>>>>>> fatal)
>>>>>> errors", then a hang.
>>>>>>> They want to keep system state since the error occurred.
>>>>>> Well that is rather problematic.
>>>>>>
>>>>>> First of all we need to really justify that, crashing the kernel
>>>>>> is not something easily done.
>>>>>>
>>>>>> Then this isn't really effective here. What happens is that you
>>>>>> crash the kernel thread of the currently executing process, but it
>>>>>> is perfectly possible that another thread still tries to send
>>>>>> messages to the SMU. You need to have the BUG_ON() before dropping
>>>>>> the lock to make sure that this really gets the driver stuck in
>>>>>> the current state.
>>>>> Thanks. I got it. I just thought it is a kenel panic.
>>>>> Could we use a panic() here?
>>>> Potentially, but that might reboot the system automatically which is
>>>> probably not what you want either.
>>>>
>>>> How does the SMU firmware team gather the necessary information when
>>>> a problem occurs?
>>> As far as I know, they usually use a HDT to collect information.
>>> And they request a hang when error occurred in ticket.
>>> "Suggested error responses include pop-up windows (by x86 driver, if
>>> this is possible) or simply hanging after logging the error."
>>
>> In that case I suggest to set the "don't_touch_the_hardware_any_more"
>> procedure we also use in case of PCIe hotplug.
>>
>> Andrey has the details but essentially it stops the driver from
>> touching the hardware any more, signals all fences and unblocks
>> everything.
>>
>> It should then be trivial to inspect the hardware state and see what's
>> going on, but the system will keep stable at least for SSH access.
>>
>> Might be a good idea to have that mode for other fault cases like page
>> faults and hardware crashes.
>>
>> Regards,
>> Christian.
>
>
>There is no one specific function that does all of that, what I think can be done is
>to bring the device to kind of halt state where no one touches it - as following -
>
>1) Follow amdpgu_pci_remove -
>
>     drm_dev_unplug to make device inaccessible to user space (IOCTLs
>e.t.c.) and clears MMIO mappings to device and disallows remappings through
>page faults
>
>     No need to call all of amdgpu_driver_unload_kms but, within it call
>amdgpu_irq_disable_all and amdgpu_fence_driver_hw_fini toi disable interrupts
>and force signall all HW fences.
>
>     pci_disable_device and pci_wait_for_pending_transaction to flush any in flight
>DMA operations from device
>
>2) set adev->no_hw_access so that most of places we access HW (all subsequent
>registers reads/writes and SMU/PSP message sending is skipped, but some race
>will be with those already in progress so maybe adding some wait)
>
>Andrey

Thanks for Christian's advice and Andrey's clarifications about that.
It seems that we should also handle kfd related stuff.

Regards,
Lang

>
>
>>
>>>
>>> Regards,
>>> Lang
>>>
>>