On Wed, Sep 02, 2020 at 02:42:03PM -0400, Andrey Grodzovsky wrote: > At this point the ASIC is already post reset by the HW/PSP > so the HW not in proper state to be configured for suspension, > some blocks might be even gated and so best is to avoid touching it. > > v2: Rename in_dpc to more meaningful name > > Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> > Reviewed-by: Alex Deucher <alexander.deucher@xxxxxxx> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 38 ++++++++++++++++++++++++++++++ > drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 6 +++++ > drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 6 +++++ > drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 18 ++++++++------ > drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 3 +++ > 6 files changed, 65 insertions(+), 7 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > index c311a3c..b20354f 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > @@ -992,6 +992,7 @@ struct amdgpu_device { > atomic_t throttling_logging_enabled; > struct ratelimit_state throttling_logging_rs; > uint32_t ras_features; > + bool in_pci_err_recovery; > }; > > static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev) > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index 74a1c03..1fbf8a1 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -319,6 +319,9 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg, > { > uint32_t ret; > > + if (adev->in_pci_err_recovery) > + return 0; I don't know the whole scheme of this, but this looks racy. It looks like the normal path through this function is the readl(), which I assume is an MMIO read from the PCI device. If this is called after a PCI error occurs, but before amdgpu_pci_slot_reset() sets adev->in_pci_err_recovery, the readl() will return 0xffffffff. If this is called after amdgpu_pci_slot_reset() sets adev->in_pci_err_recovery, it will return 0. Do you really want those two different cases? > if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev)) > return amdgpu_kiq_rreg(adev, reg); > @@ -4773,7 +4809,9 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev) > > pci_restore_state(pdev); > > + adev->in_pci_err_recovery = true; > r = amdgpu_device_ip_suspend(adev); > + adev->in_pci_err_recovery = false; > if (r) > goto out;