Re: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset

Christian König <christian.koenig@xxxxxxx> · Fri, 24 Apr 2020 10:37:25 +0200

I agree. Closing the node is fine, we just need to make sure that 
reopening it gives us another try again.

Shouldn't be to much of a problem, except for a single line the last 
version already looked like it should have worked for this.

Christian.

Am 24.04.20 um 10:30 schrieb Pierre-Eric Pelloux-Prayer:
Running "umr --autodump" currently does the following:
   - open the debugfs autodump file
   - wait for a hang (using poll)
   - dump gfx rings
   - close the file and exit

I don't think adding support for a "Done" message is necessary, but I'd expect to
be able to restart "umr --autodump" and be able to perform another dump.

Pierre-Eric

On 24/04/2020 10:24, Zhao, Jiange wrote:
[AMD Official Use Only - Internal Distribution Only]


Hi,

Of course, considering all the suggestions, I will implement a write callback for usermode app to notify KMD that a dump is finished by sending "Done".

In this way, usermode app can do multiple dumps without closing the node,

Jiange
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
*From:* Pelloux-prayer, Pierre-eric <Pierre-eric.Pelloux-prayer@xxxxxxx>
*Sent:* Friday, April 24, 2020 3:46 PM
*To:* Zhao, Jiange <Jiange.Zhao@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>
*Cc:* Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Pelloux-prayer, Pierre-eric <Pierre-eric.Pelloux-prayer@xxxxxxx>; Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Liu, Monk <Monk.Liu@xxxxxxx>; Zhang, Hawking <Hawking.Zhang@xxxxxxx>
*Subject:* Re: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset
  
Hi Jiange,

FYI I'm working on adding a new "--autodump" command to umr that uses this feature.
This is not yet merged but you can find the code here: https://gitlab.freedesktop.org/pepp/umr/-/tree/autodump

(3) At the same time, considering the use case of this node, I believe that only the first GPU reset is worthy of a dump.
If it's possible I'd like to be able to do multiple dumps instead of limiting ourselves to only the first one.

Thanks!
Pierre-Eric



(4) I didn't implement race condition guard because I believe that this node caters for a cautious super-user and a single client is enough to do all the work. I can add the logic if you think it is necessary.

Jiange

-----Original Message-----
From: Koenig, Christian <Christian.Koenig@xxxxxxx>
Sent: Thursday, April 23, 2020 4:53 PM
To: Zhao, Jiange <Jiange.Zhao@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Pelloux-prayer, Pierre-eric <Pierre-eric.Pelloux-prayer@xxxxxxx>; Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Liu, Monk <Monk.Liu@xxxxxxx>; Zhao, Jiange <Jiange.Zhao@xxxxxxx>
Subject: Re: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset

Am 23.04.20 um 09:19 schrieb jianzh@xxxxxxx:
From: Jiange Zhao <Jiange.Zhao@xxxxxxx>

When GPU got timeout, it would notify an interested part of an
opportunity to dump info before actual GPU reset.

A usermode app would open 'autodump' node under debugfs system and
poll() for readable/writable. When a GPU reset is due, amdgpu would
notify usermode app through wait_queue_head and give it 10 minutes to
dump info.

After usermode app has done its work, this 'autodump' node is closed.
On node closure, amdgpu gets to know the dump is done through the
completion that is triggered in release().

There is no write or read callback because necessary info can be
obtained through dmesg and umr. Messages back and forth between
usermode app and amdgpu are unnecessary.

Signed-off-by: Jiange Zhao <Jiange.Zhao@xxxxxxx>
---
    drivers/gpu/drm/amd/amdgpu/amdgpu.h         |  9 +++
    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 85 +++++++++++++++++++++
    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h |  1 +
    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  2 +
    4 files changed, 97 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index bc1e0fd71a09..a505b547f242 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -724,6 +724,13 @@ struct amd_powerplay {
        const struct amd_pm_funcs *pp_funcs;
    };
    
+struct amdgpu_autodump {
+    bool                            registered;
+    struct completion               completed;
Registered and completed seems to have the same meaning.

+    struct dentry                   *dentry;
+    struct wait_queue_head          gpu_hang_wait;
+};
+
    #define AMDGPU_RESET_MAGIC_NUM 64
    #define AMDGPU_MAX_DF_PERFMONS 4
    struct amdgpu_device {
@@ -990,6 +997,8 @@ struct amdgpu_device {
        char                            product_number[16];
        char                            product_name[32];
        char                            serial[16];
+
+    struct amdgpu_autodump          autodump;
    };
    
    static inline struct amdgpu_device *amdgpu_ttm_adev(struct
ttm_bo_device *bdev) diff --git
a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index 1a4894fa3693..cdd4bf00adee 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
@@ -74,8 +74,91 @@ int amdgpu_debugfs_add_files(struct amdgpu_device *adev,
        return 0;
    }
    
+int amdgpu_debugfs_wait_dump(struct amdgpu_device *adev) { #if
+defined(CONFIG_DEBUG_FS)
+    int ret;
+    unsigned long tmo = 600*HZ;
In general please declare constant lines first and variable like "i" or "r" last.

+
+    if (!adev->autodump.registered)
+            return 0;
+
+    wake_up_interruptible(&adev->autodump.gpu_hang_wait);
+
+    ret =
+wait_for_completion_interruptible_timeout(&adev->autodump.completed,
+tmo);
This is racy, in other words it can happen that a new client opens up the debugfs file without being signaled but blocks the reset here.

You could use two completion structures to avoid that.

+    if (ret == 0) { /* time out and dump tool still not finish its dump*/
+            pr_err("autodump: timeout before dump finished, move on to gpu recovery\n");
+            return -ETIMEDOUT;
+    }
+#endif
+    return 0;
+}
+
    #if defined(CONFIG_DEBUG_FS)
    
+static int amdgpu_debugfs_autodump_open(struct inode *inode, struct
+file *file) {
+    int ret;
+    struct amdgpu_device *adev;
+
+    ret = simple_open(inode, file);
+    if (ret)
+            return ret;
+
+    adev = file->private_data;
+    if (adev->autodump.registered == true)
+            return -EINVAL;
Probably better to return -EBUSY here. And this is racy, and might need a lock e.g. multiple clients could open the file at the same time.

If we use a struct completion for registered we could use the spinlock of that one for protection here.

+
+    adev->autodump.registered = true;
You also need to reset the completion structure here otherwise only the first GPU reset would work with this.

+
+    return 0;
+}
+
+static int amdgpu_debugfs_autodump_release(struct inode *inode,
+struct file *file) {
+    struct amdgpu_device *adev = file->private_data;
+
+    complete(&adev->autodump.completed);
+    adev->autodump.registered = false;
+
+    return 0;
+}
+
+unsigned int amdgpu_debugfs_autodump_poll(struct file *file, struct
+poll_table_struct *poll_table) {
+    struct amdgpu_device *adev = file->private_data;
+
+    poll_wait(file, &adev->autodump.gpu_hang_wait, poll_table);
+
+    if (adev->in_gpu_reset)
+            return POLLIN | POLLRDNORM | POLLWRNORM;
+
+    return 0;
+}
+
+static const struct file_operations autodump_debug_fops = {
+    .owner = THIS_MODULE,
+    .open = amdgpu_debugfs_autodump_open,
+    .poll = amdgpu_debugfs_autodump_poll,
+    .release = amdgpu_debugfs_autodump_release, };
+
+static int amdgpu_debugfs_autodump_init(struct amdgpu_device *adev) {
+    struct dentry *entry;
+
+    init_completion(&adev->autodump.completed);
+    init_waitqueue_head(&adev->autodump.gpu_hang_wait);
+    adev->autodump.registered = false;
+
+    entry = debugfs_create_file("autodump", 0600,
+                    adev->ddev->primary->debugfs_root,
+                    adev, &autodump_debug_fops);
+    adev->autodump.dentry = entry;
+
+    return 0;
+}
+
    /**
     * amdgpu_debugfs_process_reg_op - Handle MMIO register reads/writes
     *
@@ -1434,6 +1517,8 @@ int amdgpu_debugfs_init(struct amdgpu_device
*adev)
    
        amdgpu_ras_debugfs_create_all(adev);
    
+    amdgpu_debugfs_autodump_init(adev);
+
        return amdgpu_debugfs_add_files(adev, amdgpu_debugfs_list,
                                        ARRAY_SIZE(amdgpu_debugfs_list));
    }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h
index de12d1101526..9428940a696d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h
@@ -40,3 +40,4 @@ int amdgpu_debugfs_add_files(struct amdgpu_device *adev,
    int amdgpu_debugfs_fence_init(struct amdgpu_device *adev);
    int amdgpu_debugfs_firmware_init(struct amdgpu_device *adev);
    int amdgpu_debugfs_gem_init(struct amdgpu_device *adev);
+int amdgpu_debugfs_wait_dump(struct amdgpu_device *adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3d601d5dd5af..44e54ea7af0f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3915,6 +3915,8 @@ static int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
        int i, r = 0;
        bool need_full_reset  = *need_full_reset_arg;
    
+    amdgpu_debugfs_wait_dump(adev);
+
        /* block all schedulers and reset given job's ring */
        for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
                struct amdgpu_ring *ring = adev->rings[i];
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx



_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx