Re: [RFC PATCH 1/1] drm/amdgpu: add initial support for pci error handler

Nirmoy <nirmodas@xxxxxxx> · Thu, 13 Aug 2020 13:09:19 +0200

On 8/12/20 4:52 PM, Andrey Grodzovsky wrote:

On 8/11/20 9:30 AM, Nirmoy Das wrote:
This patch will ignore non-fatal errors and try to
stop amdgpu's sw stack on fatal errors.

Signed-off-by: Nirmoy Das <nirmoy.das@xxxxxxx>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 56 ++++++++++++++++++++++++-
  1 file changed, 54 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index c1219af2e7d6..2b9ede3000ee 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -35,6 +35,7 @@
  #include <linux/pm_runtime.h>
  #include <linux/vga_switcheroo.h>
  #include <drm/drm_probe_helper.h>
+#include <drm/drm_atomic_helper.h>
  #include <linux/mmu_notifier.h>
    #include "amdgpu.h"
@@ -1516,6 +1517,58 @@ static struct drm_driver kms_driver = {
      .patchlevel = KMS_DRIVER_PATCHLEVEL,
  };
  +static pci_ers_result_t amdgpu_pci_err_detected(struct pci_dev *pdev,
+                        pci_channel_state_t state)
+{
+    struct drm_device *dev = pci_get_drvdata(pdev);
+    struct amdgpu_device *adev = dev->dev_private;
+    int i;
+    int ret = PCI_ERS_RESULT_DISCONNECT;
+
+    switch (state) {
+    case pci_channel_io_normal:
+        ret = PCI_ERS_RESULT_CAN_RECOVER;
+        break;
+    default:
+        /* Disable power management */
+        adev->runpm = 0;
+        /* Suspend all IO operations */
+        amdgpu_fbdev_set_suspend(adev, 1);
+ cancel_delayed_work_sync(&adev->delayed_init_work);
+        for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+            struct amdgpu_ring *ring = adev->rings[i];
+
+            if (!ring || !ring->sched.thread)
+                continue;
+
+ amdgpu_job_stop_all_jobs_on_sched(&ring->sched);


You need to call drm_sched_stop first before calling this

+        }
+
+        if (adev->mode_info.mode_config_initialized) {
+            if (!amdgpu_device_has_dc_support(adev))
+                drm_helper_force_disable_all(adev->ddev);
+            else
+                drm_atomic_helper_shutdown(adev->ddev);
+        }
+
+        amdgpu_fence_driver_fini(adev);
+        amdgpu_fbdev_fini(adev);
+        /* Try to close drm device to stop applications
+         * from opening dri files for further IO operations.
+         * TODO: This will throw warning as ttm is not
+         * cleaned perperly */
+        drm_dev_fini(dev);


I think user mode applications might still hold reference to the drm 
device through through drm_dev_get either by directly opening
the device file or indirectly through importing DMA buff, if so when 
the last of them terminate 
drm_dev_put->drm_dev_release->...->drm_dev_fini
might get called again causing use after free e.t.c issues. Maybe 
better to call here drm_dev_put then and so drm_dev_fini will get 
called when this
last user client releases his reference.


drm_dev_fini() seems to be cleaner. Problem is  window manager(sway) 
never gets terminated after the AER error and drm files remains active. 
Simple cat on dri files

goes though amdgpu and spits out more errors.



Also a general question - in my work on DPC recovery feature which 
tries to recover after PCIe error - once the PCI error has happened 
MMIO registers become
unaccessible for r/w as the PCI link is dead until after the PCI link 
is reset by the DPC driver (see 
https://www.kernel.org/doc/html/latest/PCI/pci-error-recovery.html 
section 6.1.4).
Your case is to try and gracefully to close the drm device once fatal 
error happened, didn't you encounter errors or warnings when accessing 
HW registers during any of the operations
above ?


As discussed over chat, it seems aer generated with aer-inject tool just 
triggers kernel PCI error APIs but the device is still active so I 
didn't encounter any errors when accessing HW registers.


Nirmoy



Andrey


+        break;
+    }
+
+    return ret;
+}
+
+static const struct pci_error_handlers amdgpu_err_handler = {
+       .error_detected = amdgpu_pci_err_detected,
+};
+
+
  static struct pci_driver amdgpu_kms_pci_driver = {
      .name = DRIVER_NAME,
      .id_table = pciidlist,
@@ -1523,10 +1576,9 @@ static struct pci_driver amdgpu_kms_pci_driver 
= {
      .remove = amdgpu_pci_remove,
      .shutdown = amdgpu_pci_shutdown,
      .driver.pm = &amdgpu_pm_ops,
+    .err_handler = &amdgpu_err_handler,
  };
  -
-
  static int __init amdgpu_init(void)
  {
      int r;
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx