[AMD Official Use Only] >-----Original Message----- >From: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx> >Sent: Thursday, December 2, 2021 12:01 AM >To: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>; Yu, Lang ><Lang.Yu@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>; amd- >gfx@xxxxxxxxxxxxxxxxxxxxx >Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Lazar, Lijo ><Lijo.Lazar@xxxxxxx>; Huang, Ray <Ray.Huang@xxxxxxx> >Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option > > >On 2021-12-01 8:11 a.m., Christian König wrote: >> Adding Andrey as well. >> >> Am 01.12.21 um 12:37 schrieb Yu, Lang: >>> [SNIP] >>>>>>>>> + BUG_ON(unlikely(smu->smu_debug_mode) && res); >>>>>>>> BUG_ON() really crashes the kernel and is only allowed if we >>>>>>>> prevent further data corruption with that. >>>>>>>> >>>>>>>> Most of the time WARN_ON() is more appropriate, but I can't >>>>>>>> fully judge here since I don't know the SMU code well enough. >>>>>>> This is what SMU FW guys want. They want "user-visible >>>>>>> (potentially >>>>>>> fatal) >>>>>> errors", then a hang. >>>>>>> They want to keep system state since the error occurred. >>>>>> Well that is rather problematic. >>>>>> >>>>>> First of all we need to really justify that, crashing the kernel >>>>>> is not something easily done. >>>>>> >>>>>> Then this isn't really effective here. What happens is that you >>>>>> crash the kernel thread of the currently executing process, but it >>>>>> is perfectly possible that another thread still tries to send >>>>>> messages to the SMU. You need to have the BUG_ON() before dropping >>>>>> the lock to make sure that this really gets the driver stuck in >>>>>> the current state. >>>>> Thanks. I got it. I just thought it is a kenel panic. >>>>> Could we use a panic() here? >>>> Potentially, but that might reboot the system automatically which is >>>> probably not what you want either. >>>> >>>> How does the SMU firmware team gather the necessary information when >>>> a problem occurs? >>> As far as I know, they usually use a HDT to collect information. >>> And they request a hang when error occurred in ticket. >>> "Suggested error responses include pop-up windows (by x86 driver, if >>> this is possible) or simply hanging after logging the error." >> >> In that case I suggest to set the "don't_touch_the_hardware_any_more" >> procedure we also use in case of PCIe hotplug. >> >> Andrey has the details but essentially it stops the driver from >> touching the hardware any more, signals all fences and unblocks >> everything. >> >> It should then be trivial to inspect the hardware state and see what's >> going on, but the system will keep stable at least for SSH access. >> >> Might be a good idea to have that mode for other fault cases like page >> faults and hardware crashes. >> >> Regards, >> Christian. > > >There is no one specific function that does all of that, what I think can be done is >to bring the device to kind of halt state where no one touches it - as following - > >1) Follow amdpgu_pci_remove - > > drm_dev_unplug to make device inaccessible to user space (IOCTLs >e.t.c.) and clears MMIO mappings to device and disallows remappings through >page faults > > No need to call all of amdgpu_driver_unload_kms but, within it call >amdgpu_irq_disable_all and amdgpu_fence_driver_hw_fini toi disable interrupts >and force signall all HW fences. > > pci_disable_device and pci_wait_for_pending_transaction to flush any in flight >DMA operations from device > >2) set adev->no_hw_access so that most of places we access HW (all subsequent >registers reads/writes and SMU/PSP message sending is skipped, but some race >will be with those already in progress so maybe adding some wait) > >Andrey Thanks for Christian's advice and Andrey's clarifications about that. It seems that we should also handle kfd related stuff. Regards, Lang > > >> >>> >>> Regards, >>> Lang >>> >>