Re: [PATCH 1/2] drm/amdgpu: introduce a kind of halt state for amdgpu device

Lang Yu <Lang.Yu@xxxxxxx> · Fri, 10 Dec 2021 11:47:00 +0800

On 12/09/ , Christian KKKnig wrote:
> Am 09.12.21 um 16:38 schrieb Andrey Grodzovsky:
> > 
> > On 2021-12-09 4:00 a.m., Christian König wrote:
> > > 
> > > 
> > > Am 09.12.21 um 09:49 schrieb Lang Yu:
> > > > It is useful to maintain error context when debugging
> > > > SW/FW issues. We introduce amdgpu_device_halt() for this
> > > > purpose. It will bring hardware to a kind of halt state,
> > > > so that no one can touch it any more.
> > > > 
> > > > Compare to a simple hang, the system will keep stable
> > > > at least for SSH access. Then it should be trivial to
> > > > inspect the hardware state and see what's going on.
> > > > 
> > > > Suggested-by: Christian Koenig <christian.koenig@xxxxxxx>
> > > > Suggested-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
> > > > Signed-off-by: Lang Yu <lang.yu@xxxxxxx>
> > > > ---
> > > >   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 ++
> > > >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 39
> > > > ++++++++++++++++++++++
> > > >   2 files changed, 41 insertions(+)
> > > > 
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > index c5cfe2926ca1..3f5f8f62aa5c 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > @@ -1317,6 +1317,8 @@ void amdgpu_device_flush_hdp(struct
> > > > amdgpu_device *adev,
> > > >   void amdgpu_device_invalidate_hdp(struct amdgpu_device *adev,
> > > >           struct amdgpu_ring *ring);
> > > >   +void amdgpu_device_halt(struct amdgpu_device *adev);
> > > > +
> > > >   /* atpx handler */
> > > >   #if defined(CONFIG_VGA_SWITCHEROO)
> > > >   void amdgpu_register_atpx_handler(void);
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > index a1c14466f23d..62216627cc83 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > @@ -5634,3 +5634,42 @@ void amdgpu_device_invalidate_hdp(struct
> > > > amdgpu_device *adev,
> > > >         amdgpu_asic_invalidate_hdp(adev, ring);
> > > >   }
> > > > +
> > > > +/**
> > > > + * amdgpu_device_halt() - bring hardware to some kind of halt state
> > > > + *
> > > > + * @adev: amdgpu_device pointer
> > > > + *
> > > > + * Bring hardware to some kind of halt state so that no one can
> > > > touch it
> > > > + * any more. It will help to maintain error context when error
> > > > occurred.
> > > > + * Compare to a simple hang, the system will keep stable at
> > > > least for SSH
> > > > + * access. Then it should be trivial to inspect the hardware state and
> > > > + * see what's going on. Implemented as following:
> > > > + *
> > > > + * 1. drm_dev_unplug() makes device inaccessible to user
> > > > space(IOCTLs, etc),
> > > > + *    clears all CPU mappings to device, disallows remappings through
> > > > page faults
> > > > + * 2. amdgpu_irq_disable_all() disables all interrupts
> > > > + * 3. amdgpu_fence_driver_hw_fini() signals all HW fences
> > > > + * 4. amdgpu_device_unmap_mmio() clears all MMIO mappings
> > > > + * 5. pci_disable_device() and pci_wait_for_pending_transaction()
> > > > + *    flush any in flight DMA operations
> > > > + * 6. set adev->no_hw_access to true
> > > > + */
> > > > +void amdgpu_device_halt(struct amdgpu_device *adev)
> > > > +{
> > > > +    struct pci_dev *pdev = adev->pdev;
> > > > +    struct drm_device *ddev = &adev->ddev;
> > > > +
> > > > +    drm_dev_unplug(ddev);
> > > > +
> > > > +    amdgpu_irq_disable_all(adev);
> > > > +
> > > > +    amdgpu_fence_driver_hw_fini(adev);
> > > > +
> > > > +    amdgpu_device_unmap_mmio(adev);
> > 
> > 
> > Note that this one will cause page fault on any subsequent MMIO access
> > (trough registers or by direct VRAM access)
> > 
> > 
> > > > 
> > > > +
> > > > +    pci_disable_device(pdev);
> > > > +    pci_wait_for_pending_transaction(pdev);
> > > > +
> > > > +    adev->no_hw_access = true;
> > > 
> > > I think we need to reorder this, e.g. set adev->no_hw_access much
> > > earlier for example. Andrey what do you think?
> > 
> > 
> > Earlier can be ok but at least after the last HW configuration we
> > actaully want to do like disabling IRQs.
> 
> My thinking was to at least do this before we unmap the MMIO to avoid the
> crash.
> 
> Additionally to that we maybe don't even want to do this for this case.

So we just do "adev->no_hw_access = true;" before
"amdgpu_device_unmap_mmio(adev);".

That can avoid potential registers access page faults.
But direct VRAM access will still trigger page faults.

For example, 
"cat /sys/class/drm/card0/device/pp_od_clk_voltage"
will call smu_cmn_update_table and can still trigger 
a page fault.

smu_cmn_update_table()
{
 ...
	if (drv2smu) {
		memcpy(table->cpu_addr, table_data, table_size);
 ...
}

Regards,
Lang

> Christian.
> 
> > 
> > 
> > Andrey
> > 
> > > 
> > > Apart from that sounds like the right idea to me.
> > > 
> > > Regards,
> > > Christian.
> > > 
> > > > +}
> > > 
>