Re: [RFC PATCH 1/1] drm/amdgpu: add initial support for pci error handler

Nirmoy <nirmodas@xxxxxxx> · Fri, 14 Aug 2020 17:23:39 +0200

On 8/13/20 11:17 PM, Luben Tuikov wrote:
I support having AER handling.

However, I feel it should be offloaded to the DRM layer.
The PCI driver gets the AER callback and immediately
offloads into DRM, as "return drm_aer_recover(dev); }".
The DRM layer does a top-down approach into the error
recovery procedure.

The PCI device driver provides (would/should?) a capability
(at registration) which the DRM layer would inspect and
subsequently call into the PCI driver's low-level methods
to recover the error or to reset the device.

But it should be a top-down approach. I believe the thread
below has somehow hinted at this.

The implementation below boils down to:

	If recoverable error, all is good.
	If unrecoverable error, then
		disable power management,
		suspend I/O operations,
		cancel pending work,
		call into PCI driver to clear
			any state it keeps,
		call into PCI driver to reset display control.
		Etc.

And this regime could be performed by DRM.

And as you can see now, the function implemented below,
*calls into* (that's the key here!) DRM, and it should be
the other way around--the DRM should call into the PCI driver,
after the PCI driver's callback immediately calls into DRM,
as outlined above.

This abstraction could be expanded to more concepts of PCIe GPU drivers,
and it would scale well, beyond PCIe as a protocol for graphics.

If drm handles pci error callbacks then it should also handle power 
management

because power management also calls into drm in very similar way. I 
think these are very

low device level tasks for drm.


+static const struct pci_error_handlers amdgpu_err_handler = {
That's too generic a name for this. I'd rather add "pci" in there,

static const struct pci_error_handlers amdgpu_pci_err_handler =  {


True, thanks for the name suggestion.


Nirmoy


	.element = init,
	...
};

Being a singular noun from the outset is good and this is preserved.

+       .error_detected = amdgpu_pci_err_detected,
+};
+
+
  static struct pci_driver amdgpu_kms_pci_driver = {
  	.name = DRIVER_NAME,
  	.id_table = pciidlist,
@@ -1523,10 +1576,9 @@ static struct pci_driver amdgpu_kms_pci_driver = {
  	.remove = amdgpu_pci_remove,
  	.shutdown = amdgpu_pci_shutdown,
  	.driver.pm = &amdgpu_pm_ops,
+	.err_handler = &amdgpu_err_handler,
".err_handler = amdgpu_pci_err_handler,"


Regards,
Luben

On 2020-08-13 2:18 p.m., Andrey Grodzovsky wrote:
On 8/13/20 11:06 AM, Nirmoy wrote:
On 8/13/20 3:38 PM, Andrey Grodzovsky wrote:
On 8/13/20 7:09 AM, Nirmoy wrote:
On 8/12/20 4:52 PM, Andrey Grodzovsky wrote:
On 8/11/20 9:30 AM, Nirmoy Das wrote:
This patch will ignore non-fatal errors and try to
stop amdgpu's sw stack on fatal errors.

Signed-off-by: Nirmoy Das <nirmoy.das@xxxxxxx>
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 56 ++++++++++++++++++++++++-
   1 file changed, 54 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index c1219af2e7d6..2b9ede3000ee 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -35,6 +35,7 @@
   #include <linux/pm_runtime.h>
   #include <linux/vga_switcheroo.h>
   #include <drm/drm_probe_helper.h>
+#include <drm/drm_atomic_helper.h>
   #include <linux/mmu_notifier.h>
     #include "amdgpu.h"
@@ -1516,6 +1517,58 @@ static struct drm_driver kms_driver = {
       .patchlevel = KMS_DRIVER_PATCHLEVEL,
   };
   +static pci_ers_result_t amdgpu_pci_err_detected(struct pci_dev *pdev,
+                        pci_channel_state_t state)
+{
+    struct drm_device *dev = pci_get_drvdata(pdev);
+    struct amdgpu_device *adev = dev->dev_private;
+    int i;
+    int ret = PCI_ERS_RESULT_DISCONNECT;
+
+    switch (state) {
+    case pci_channel_io_normal:
+        ret = PCI_ERS_RESULT_CAN_RECOVER;
+        break;
+    default:
+        /* Disable power management */
+        adev->runpm = 0;
+        /* Suspend all IO operations */
+        amdgpu_fbdev_set_suspend(adev, 1);
+ cancel_delayed_work_sync(&adev->delayed_init_work);
+        for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+            struct amdgpu_ring *ring = adev->rings[i];
+
+            if (!ring || !ring->sched.thread)
+                continue;
+
+ amdgpu_job_stop_all_jobs_on_sched(&ring->sched);

You need to call drm_sched_stop first before calling this

+        }
+
+        if (adev->mode_info.mode_config_initialized) {
+            if (!amdgpu_device_has_dc_support(adev))
+ drm_helper_force_disable_all(adev->ddev);
+            else
+ drm_atomic_helper_shutdown(adev->ddev);
+        }
+
+        amdgpu_fence_driver_fini(adev);
+        amdgpu_fbdev_fini(adev);
+        /* Try to close drm device to stop applications
+         * from opening dri files for further IO operations.
+         * TODO: This will throw warning as ttm is not
+         * cleaned perperly */
+        drm_dev_fini(dev);

I think user mode applications might still hold reference to the drm device
through through drm_dev_get either by directly opening
the device file or indirectly through importing DMA buff, if so when the
last of them terminate drm_dev_put->drm_dev_release->...->drm_dev_fini
might get called again causing use after free e.t.c issues. Maybe better to
call here drm_dev_put then and so drm_dev_fini will get called when this
last user client releases his reference.

drm_dev_fini() seems to be cleaner. Problem is  window manager(sway) never
gets terminated after the AER error and drm files remains active. Simple cat
on dri files

goes though amdgpu and spits out more errors.

What happens if you kill the window manager after you closed drm device with
your original code applied ? I would expect drm_dev_fini to be called again
for the reason i explained above and this would obviously would be wrong to
happen.
Hi Andrey,


hmm I quickly tried that, Kernel crashed and later rebooted after sometime. I
don't have a serial console to check logs and there was no logs afterwards in
journalctl.

drm_dev_put() had similar behavior, kernel/machine was inaccessible over ssh.


Did you face same behavior while testing gpu hotplug ?


Nirmoy

Yea, in my case device sysfs structure was removed on pci_remove while when last
user client dropped reference and this led
to drm_dev_fini to be called there were more sysfs entries removal there which
lead to a crash. But here i don't think the sysfs for drm_device
is removed because the device is not extracted...

Andrey



Andrey



Also a general question - in my work on DPC recovery feature which tries to
recover after PCIe error - once the PCI error has happened MMIO registers
become
unaccessible for r/w as the PCI link is dead until after the PCI link is
reset by the DPC driver (see
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.kernel.org%2Fdoc%2Fhtml%2Flatest%2FPCI%2Fpci-error-recovery.html&amp;data=02%7C01%7Cluben.tuikov%40amd.com%7C4490887b817b47b029a808d83fb5525f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637329395327201326&amp;sdata=ukcNKOfdzIo7CdKr0yq78vLgqKkzHtqa%2FangK7OEYsA%3D&amp;reserved=0 section
6.1.4).
Your case is to try and gracefully to close the drm device once fatal error
happened, didn't you encounter errors or warnings when accessing HW
registers during any of the operations
above ?

As discussed over chat, it seems aer generated with aer-inject tool just
triggers kernel PCI error APIs but the device is still active so I didn't
encounter any errors when accessing HW registers.


Nirmoy


Andrey


+        break;
+    }
+
+    return ret;
+}
+
+static const struct pci_error_handlers amdgpu_err_handler = {
+       .error_detected = amdgpu_pci_err_detected,
+};
+
+
   static struct pci_driver amdgpu_kms_pci_driver = {
       .name = DRIVER_NAME,
       .id_table = pciidlist,
@@ -1523,10 +1576,9 @@ static struct pci_driver amdgpu_kms_pci_driver = {
       .remove = amdgpu_pci_remove,
       .shutdown = amdgpu_pci_shutdown,
       .driver.pm = &amdgpu_pm_ops,
+    .err_handler = &amdgpu_err_handler,
   };
   -
-
   static int __init amdgpu_init(void)
   {
       int r;
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cluben.tuikov%40amd.com%7C4490887b817b47b029a808d83fb5525f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637329395327201326&amp;sdata=hUiqeBUx%2BhXqgL21ewNbPBP0WQp1i66av5CDeg9CF38%3D&amp;reserved=0

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx