On 9/2/20 5:56 PM, Bjorn Helgaas wrote:
On Wed, Sep 02, 2020 at 02:42:03PM -0400, Andrey Grodzovsky wrote:At this point the ASIC is already post reset by the HW/PSP so the HW not in proper state to be configured for suspension, some blocks might be even gated and so best is to avoid touching it. v2: Rename in_dpc to more meaningful name Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> Reviewed-by: Alex Deucher <alexander.deucher@xxxxxxx> --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 38 ++++++++++++++++++++++++++++++ drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 6 +++++ drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 6 +++++ drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 18 ++++++++------ drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 3 +++ 6 files changed, 65 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index c311a3c..b20354f 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -992,6 +992,7 @@ struct amdgpu_device { atomic_t throttling_logging_enabled; struct ratelimit_state throttling_logging_rs; uint32_t ras_features; + bool in_pci_err_recovery; }; static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 74a1c03..1fbf8a1 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -319,6 +319,9 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg, { uint32_t ret; + if (adev->in_pci_err_recovery) + return 0;I don't know the whole scheme of this, but this looks racy. It looks like the normal path through this function is the readl(), which I assume is an MMIO read from the PCI device. If this is called after a PCI error occurs, but before amdgpu_pci_slot_reset() sets adev->in_pci_err_recovery, the readl() will return 0xffffffff. If this is called after amdgpu_pci_slot_reset() sets adev->in_pci_err_recovery, it will return 0. Do you really want those two different cases?
This is not intended to to avoid hardware accessed by other
threads when doing PCI recovery (answers also Denis's question) -
Andrey
if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev)) return amdgpu_kiq_rreg(adev, reg);@@ -4773,7 +4809,9 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev) pci_restore_state(pdev); + adev->in_pci_err_recovery = true; r = amdgpu_device_ip_suspend(adev); + adev->in_pci_err_recovery = false; if (r) goto out; |
_______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx