It's based on v5.9-rc2
but won't apply cleanly since there is a significant amount of amd-staging-drm-next patches which this was applied on top of.
Andrey
From: Bjorn Helgaas <helgaas@xxxxxxxxxx>
Sent: 02 September 2020 17:36
To: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>
Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>; sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx <sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx>; linux-pci@xxxxxxxxxxxxxxx <linux-pci@xxxxxxxxxxxxxxx>; Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Das, Nirmoy <Nirmoy.Das@xxxxxxx>; Li, Dennis <Dennis.Li@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>; Tuikov, Luben <Luben.Tuikov@xxxxxxx>; bhelgaas@xxxxxxxxxx <bhelgaas@xxxxxxxxxx>
Subject: Re: [PATCH v4 0/8] Implement PCI Error Recovery on Navi12
Sent: 02 September 2020 17:36
To: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>
Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>; sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx <sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx>; linux-pci@xxxxxxxxxxxxxxx <linux-pci@xxxxxxxxxxxxxxx>; Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Das, Nirmoy <Nirmoy.Das@xxxxxxx>; Li, Dennis <Dennis.Li@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>; Tuikov, Luben <Luben.Tuikov@xxxxxxx>; bhelgaas@xxxxxxxxxx <bhelgaas@xxxxxxxxxx>
Subject: Re: [PATCH v4 0/8] Implement PCI Error Recovery on Navi12
On Wed, Sep 02, 2020 at 02:42:02PM -0400, Andrey Grodzovsky wrote:
> Many PCI bus controllers are able to detect a variety of hardware PCI errors on the bus,
> such as parity errors on the data and address buses, A typical action taken is to disconnect
> the affected device, halting all I/O to it. Typically, a reconnection mechanism is also offered,
> so that the affected PCI device(s) are reset and put back into working condition.
> In our case the reconnection mechanism is facilitated by kernel Downstream Port Containment (DPC)
> driver which will intercept the PCIe error, remove (isolate) the faulting device after which it
> will call into PCIe recovery code of the PCI core.
> This code will call hooks which are implemented in this patchset where the error is
> first reported at which point we block the GPU scheduler, next DPC resets the
> PCI link which generates HW interrupt which is intercepted by SMU/PSP who
> start executing mode1 reset of the ASIC, next step is slot reset hook is called
> at which point we wait for ASIC reset to complete, restore PCI config space and run
> HW suspend/resume sequence to resinit the ASIC.
> Last hook called is resume normal operation at which point we will restart the GPU scheduler.
>
> More info on PCIe error handling and DPC are here:
> https://nam11.safelinks.protection.outlook.com/?url="">
> https://nam11.safelinks.protection.outlook.com/?url="">
>
> v4:Rebase to 5.9 kernel and revert PCI error recovery core commit which breaks the feature.
What does this apply to? I tried
- v5.9-rc1 (9123e3a74ec7 ("Linux 5.9-rc1")),
- v5.9-rc2 (d012a7190fc1 ("Linux 5.9-rc2")),
- v5.9-rc3 (f75aef392f86 ("Linux 5.9-rc3")),
- drm-next (3393649977f9 ("Merge tag 'drm-intel-next-2020-08-24-1' of git://anongit.freedesktop.org/drm/drm-intel into drm-next")),
- linux-next (4442749a2031 ("Add linux-next specific files for 20200902"))
but it doesn't apply cleanly to any.
> Andrey Grodzovsky (8):
> drm/amdgpu: Avoid accessing HW when suspending SW state
> drm/amdgpu: Block all job scheduling activity during DPC recovery
> drm/amdgpu: Fix SMU error failure
> drm/amdgpu: Fix consecutive DPC recovery failures.
> drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code.
> drm/amdgpu: Disable DPC for XGMI for now.
> drm/amdgpu: Minor checkpatch fix
> Revert "PCI/ERR: Update error status after reset_link()"
>
> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 6 +
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 247 +++++++++++++++++++++--------
> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 4 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 6 +
> drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 6 +
> drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 18 ++-
> drivers/gpu/drm/amd/amdgpu/nv.c | 4 +-
> drivers/gpu/drm/amd/amdgpu/soc15.c | 4 +-
> drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 3 +
> drivers/pci/pcie/err.c | 3 +-
> 10 files changed, 222 insertions(+), 79 deletions(-)
>
> --
> 2.7.4
>
> Many PCI bus controllers are able to detect a variety of hardware PCI errors on the bus,
> such as parity errors on the data and address buses, A typical action taken is to disconnect
> the affected device, halting all I/O to it. Typically, a reconnection mechanism is also offered,
> so that the affected PCI device(s) are reset and put back into working condition.
> In our case the reconnection mechanism is facilitated by kernel Downstream Port Containment (DPC)
> driver which will intercept the PCIe error, remove (isolate) the faulting device after which it
> will call into PCIe recovery code of the PCI core.
> This code will call hooks which are implemented in this patchset where the error is
> first reported at which point we block the GPU scheduler, next DPC resets the
> PCI link which generates HW interrupt which is intercepted by SMU/PSP who
> start executing mode1 reset of the ASIC, next step is slot reset hook is called
> at which point we wait for ASIC reset to complete, restore PCI config space and run
> HW suspend/resume sequence to resinit the ASIC.
> Last hook called is resume normal operation at which point we will restart the GPU scheduler.
>
> More info on PCIe error handling and DPC are here:
> https://nam11.safelinks.protection.outlook.com/?url="">
> https://nam11.safelinks.protection.outlook.com/?url="">
>
> v4:Rebase to 5.9 kernel and revert PCI error recovery core commit which breaks the feature.
What does this apply to? I tried
- v5.9-rc1 (9123e3a74ec7 ("Linux 5.9-rc1")),
- v5.9-rc2 (d012a7190fc1 ("Linux 5.9-rc2")),
- v5.9-rc3 (f75aef392f86 ("Linux 5.9-rc3")),
- drm-next (3393649977f9 ("Merge tag 'drm-intel-next-2020-08-24-1' of git://anongit.freedesktop.org/drm/drm-intel into drm-next")),
- linux-next (4442749a2031 ("Add linux-next specific files for 20200902"))
but it doesn't apply cleanly to any.
> Andrey Grodzovsky (8):
> drm/amdgpu: Avoid accessing HW when suspending SW state
> drm/amdgpu: Block all job scheduling activity during DPC recovery
> drm/amdgpu: Fix SMU error failure
> drm/amdgpu: Fix consecutive DPC recovery failures.
> drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code.
> drm/amdgpu: Disable DPC for XGMI for now.
> drm/amdgpu: Minor checkpatch fix
> Revert "PCI/ERR: Update error status after reset_link()"
>
> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 6 +
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 247 +++++++++++++++++++++--------
> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 4 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 6 +
> drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 6 +
> drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 18 ++-
> drivers/gpu/drm/amd/amdgpu/nv.c | 4 +-
> drivers/gpu/drm/amd/amdgpu/soc15.c | 4 +-
> drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 3 +
> drivers/pci/pcie/err.c | 3 +-
> 10 files changed, 222 insertions(+), 79 deletions(-)
>
> --
> 2.7.4
>
_______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx