On 12/09/ , Christian KKKnig wrote: > Am 09.12.21 um 16:38 schrieb Andrey Grodzovsky: > > > > On 2021-12-09 4:00 a.m., Christian König wrote: > > > > > > > > > Am 09.12.21 um 09:49 schrieb Lang Yu: > > > > It is useful to maintain error context when debugging > > > > SW/FW issues. We introduce amdgpu_device_halt() for this > > > > purpose. It will bring hardware to a kind of halt state, > > > > so that no one can touch it any more. > > > > > > > > Compare to a simple hang, the system will keep stable > > > > at least for SSH access. Then it should be trivial to > > > > inspect the hardware state and see what's going on. > > > > > > > > Suggested-by: Christian Koenig <christian.koenig@xxxxxxx> > > > > Suggested-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> > > > > Signed-off-by: Lang Yu <lang.yu@xxxxxxx> > > > > --- > > > > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 ++ > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 39 > > > > ++++++++++++++++++++++ > > > > 2 files changed, 41 insertions(+) > > > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > > > index c5cfe2926ca1..3f5f8f62aa5c 100644 > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > > > @@ -1317,6 +1317,8 @@ void amdgpu_device_flush_hdp(struct > > > > amdgpu_device *adev, > > > > void amdgpu_device_invalidate_hdp(struct amdgpu_device *adev, > > > > struct amdgpu_ring *ring); > > > > +void amdgpu_device_halt(struct amdgpu_device *adev); > > > > + > > > > /* atpx handler */ > > > > #if defined(CONFIG_VGA_SWITCHEROO) > > > > void amdgpu_register_atpx_handler(void); > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > > index a1c14466f23d..62216627cc83 100644 > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > > @@ -5634,3 +5634,42 @@ void amdgpu_device_invalidate_hdp(struct > > > > amdgpu_device *adev, > > > > amdgpu_asic_invalidate_hdp(adev, ring); > > > > } > > > > + > > > > +/** > > > > + * amdgpu_device_halt() - bring hardware to some kind of halt state > > > > + * > > > > + * @adev: amdgpu_device pointer > > > > + * > > > > + * Bring hardware to some kind of halt state so that no one can > > > > touch it > > > > + * any more. It will help to maintain error context when error > > > > occurred. > > > > + * Compare to a simple hang, the system will keep stable at > > > > least for SSH > > > > + * access. Then it should be trivial to inspect the hardware state and > > > > + * see what's going on. Implemented as following: > > > > + * > > > > + * 1. drm_dev_unplug() makes device inaccessible to user > > > > space(IOCTLs, etc), > > > > + * clears all CPU mappings to device, disallows remappings through > > > > page faults > > > > + * 2. amdgpu_irq_disable_all() disables all interrupts > > > > + * 3. amdgpu_fence_driver_hw_fini() signals all HW fences > > > > + * 4. amdgpu_device_unmap_mmio() clears all MMIO mappings > > > > + * 5. pci_disable_device() and pci_wait_for_pending_transaction() > > > > + * flush any in flight DMA operations > > > > + * 6. set adev->no_hw_access to true > > > > + */ > > > > +void amdgpu_device_halt(struct amdgpu_device *adev) > > > > +{ > > > > + struct pci_dev *pdev = adev->pdev; > > > > + struct drm_device *ddev = &adev->ddev; > > > > + > > > > + drm_dev_unplug(ddev); > > > > + > > > > + amdgpu_irq_disable_all(adev); > > > > + > > > > + amdgpu_fence_driver_hw_fini(adev); > > > > + > > > > + amdgpu_device_unmap_mmio(adev); > > > > > > Note that this one will cause page fault on any subsequent MMIO access > > (trough registers or by direct VRAM access) > > > > > > > > > > > > + > > > > + pci_disable_device(pdev); > > > > + pci_wait_for_pending_transaction(pdev); > > > > + > > > > + adev->no_hw_access = true; > > > > > > I think we need to reorder this, e.g. set adev->no_hw_access much > > > earlier for example. Andrey what do you think? > > > > > > Earlier can be ok but at least after the last HW configuration we > > actaully want to do like disabling IRQs. > > My thinking was to at least do this before we unmap the MMIO to avoid the > crash. > > Additionally to that we maybe don't even want to do this for this case. So we just do "adev->no_hw_access = true;" before "amdgpu_device_unmap_mmio(adev);". That can avoid potential registers access page faults. But direct VRAM access will still trigger page faults. For example, "cat /sys/class/drm/card0/device/pp_od_clk_voltage" will call smu_cmn_update_table and can still trigger a page fault. smu_cmn_update_table() { ... if (drv2smu) { memcpy(table->cpu_addr, table_data, table_size); ... } Regards, Lang > Christian. > > > > > > > Andrey > > > > > > > > Apart from that sounds like the right idea to me. > > > > > > Regards, > > > Christian. > > > > > > > +} > > > >