[AMD Official Use Only - Internal Distribution Only]
Hi,
Of course, considering all the suggestions, I will implement a write callback for usermode app to notify KMD that a dump is finished by sending "Done".
In this way, usermode app can do multiple dumps without closing the node,
Jiange
From: Pelloux-prayer, Pierre-eric <Pierre-eric.Pelloux-prayer@xxxxxxx>
Sent: Friday, April 24, 2020 3:46 PM To: Zhao, Jiange <Jiange.Zhao@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx <amd-gfx@xxxxxxxxxxxxxxxxxxxxx> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Pelloux-prayer, Pierre-eric <Pierre-eric.Pelloux-prayer@xxxxxxx>; Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Liu, Monk <Monk.Liu@xxxxxxx>; Zhang, Hawking <Hawking.Zhang@xxxxxxx> Subject: Re: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset Hi Jiange,
FYI I'm working on adding a new "--autodump" command to umr that uses this feature. This is not yet merged but you can find the code here: https://gitlab.freedesktop.org/pepp/umr/-/tree/autodump > (3) At the same time, considering the use case of this node, I believe that only the first GPU reset is worthy of a dump. If it's possible I'd like to be able to do multiple dumps instead of limiting ourselves to only the first one. Thanks! Pierre-Eric > (4) I didn't implement race condition guard because I believe that this node caters for a cautious super-user and a single client is enough to do all the work. I can add the logic if you think it is necessary. > > Jiange > > -----Original Message----- > From: Koenig, Christian <Christian.Koenig@xxxxxxx> > Sent: Thursday, April 23, 2020 4:53 PM > To: Zhao, Jiange <Jiange.Zhao@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Cc: Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Pelloux-prayer, Pierre-eric <Pierre-eric.Pelloux-prayer@xxxxxxx>; Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Liu, Monk <Monk.Liu@xxxxxxx>; Zhao, Jiange <Jiange.Zhao@xxxxxxx> > Subject: Re: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset > > Am 23.04.20 um 09:19 schrieb jianzh@xxxxxxx: >> From: Jiange Zhao <Jiange.Zhao@xxxxxxx> >> >> When GPU got timeout, it would notify an interested part of an >> opportunity to dump info before actual GPU reset. >> >> A usermode app would open 'autodump' node under debugfs system and >> poll() for readable/writable. When a GPU reset is due, amdgpu would >> notify usermode app through wait_queue_head and give it 10 minutes to >> dump info. >> >> After usermode app has done its work, this 'autodump' node is closed. >> On node closure, amdgpu gets to know the dump is done through the >> completion that is triggered in release(). >> >> There is no write or read callback because necessary info can be >> obtained through dmesg and umr. Messages back and forth between >> usermode app and amdgpu are unnecessary. >> >> Signed-off-by: Jiange Zhao <Jiange.Zhao@xxxxxxx> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 9 +++ >> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 85 +++++++++++++++++++++ >> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h | 1 + >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 + >> 4 files changed, 97 insertions(+) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h >> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h >> index bc1e0fd71a09..a505b547f242 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h >> @@ -724,6 +724,13 @@ struct amd_powerplay { >> const struct amd_pm_funcs *pp_funcs; >> }; >> >> +struct amdgpu_autodump { >> + bool registered; >> + struct completion completed; > > Registered and completed seems to have the same meaning. > >> + struct dentry *dentry; >> + struct wait_queue_head gpu_hang_wait; >> +}; >> + >> #define AMDGPU_RESET_MAGIC_NUM 64 >> #define AMDGPU_MAX_DF_PERFMONS 4 >> struct amdgpu_device { >> @@ -990,6 +997,8 @@ struct amdgpu_device { >> char product_number[16]; >> char product_name[32]; >> char serial[16]; >> + >> + struct amdgpu_autodump autodump; >> }; >> >> static inline struct amdgpu_device *amdgpu_ttm_adev(struct >> ttm_bo_device *bdev) diff --git >> a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >> index 1a4894fa3693..cdd4bf00adee 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >> @@ -74,8 +74,91 @@ int amdgpu_debugfs_add_files(struct amdgpu_device *adev, >> return 0; >> } >> >> +int amdgpu_debugfs_wait_dump(struct amdgpu_device *adev) { #if >> +defined(CONFIG_DEBUG_FS) >> + int ret; >> + unsigned long tmo = 600*HZ; > > In general please declare constant lines first and variable like "i" or "r" last. > >> + >> + if (!adev->autodump.registered) >> + return 0; >> + >> + wake_up_interruptible(&adev->autodump.gpu_hang_wait); >> + >> + ret = >> +wait_for_completion_interruptible_timeout(&adev->autodump.completed, >> +tmo); > > This is racy, in other words it can happen that a new client opens up the debugfs file without being signaled but blocks the reset here. > > You could use two completion structures to avoid that. > >> + if (ret == 0) { /* time out and dump tool still not finish its dump*/ >> + pr_err("autodump: timeout before dump finished, move on to gpu recovery\n"); >> + return -ETIMEDOUT; >> + } >> +#endif >> + return 0; >> +} >> + >> #if defined(CONFIG_DEBUG_FS) >> >> +static int amdgpu_debugfs_autodump_open(struct inode *inode, struct >> +file *file) { >> + int ret; >> + struct amdgpu_device *adev; >> + >> + ret = simple_open(inode, file); >> + if (ret) >> + return ret; >> + >> + adev = file->private_data; >> + if (adev->autodump.registered == true) >> + return -EINVAL; > > Probably better to return -EBUSY here. And this is racy, and might need a lock e.g. multiple clients could open the file at the same time. > > If we use a struct completion for registered we could use the spinlock of that one for protection here. > >> + >> + adev->autodump.registered = true; > > You also need to reset the completion structure here otherwise only the first GPU reset would work with this. > >> + >> + return 0; >> +} >> + >> +static int amdgpu_debugfs_autodump_release(struct inode *inode, >> +struct file *file) { >> + struct amdgpu_device *adev = file->private_data; >> + >> + complete(&adev->autodump.completed); >> + adev->autodump.registered = false; >> + >> + return 0; >> +} >> + >> +unsigned int amdgpu_debugfs_autodump_poll(struct file *file, struct >> +poll_table_struct *poll_table) { >> + struct amdgpu_device *adev = file->private_data; >> + >> + poll_wait(file, &adev->autodump.gpu_hang_wait, poll_table); >> + >> + if (adev->in_gpu_reset) >> + return POLLIN | POLLRDNORM | POLLWRNORM; >> + >> + return 0; >> +} >> + >> +static const struct file_operations autodump_debug_fops = { >> + .owner = THIS_MODULE, >> + .open = amdgpu_debugfs_autodump_open, >> + .poll = amdgpu_debugfs_autodump_poll, >> + .release = amdgpu_debugfs_autodump_release, }; >> + >> +static int amdgpu_debugfs_autodump_init(struct amdgpu_device *adev) { >> + struct dentry *entry; >> + >> + init_completion(&adev->autodump.completed); >> + init_waitqueue_head(&adev->autodump.gpu_hang_wait); >> + adev->autodump.registered = false; >> + >> + entry = debugfs_create_file("autodump", 0600, >> + adev->ddev->primary->debugfs_root, >> + adev, &autodump_debug_fops); >> + adev->autodump.dentry = entry; >> + >> + return 0; >> +} >> + >> /** >> * amdgpu_debugfs_process_reg_op - Handle MMIO register reads/writes >> * >> @@ -1434,6 +1517,8 @@ int amdgpu_debugfs_init(struct amdgpu_device >> *adev) >> >> amdgpu_ras_debugfs_create_all(adev); >> >> + amdgpu_debugfs_autodump_init(adev); >> + >> return amdgpu_debugfs_add_files(adev, amdgpu_debugfs_list, >> ARRAY_SIZE(amdgpu_debugfs_list)); >> } >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h >> index de12d1101526..9428940a696d 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h >> @@ -40,3 +40,4 @@ int amdgpu_debugfs_add_files(struct amdgpu_device *adev, >> int amdgpu_debugfs_fence_init(struct amdgpu_device *adev); >> int amdgpu_debugfs_firmware_init(struct amdgpu_device *adev); >> int amdgpu_debugfs_gem_init(struct amdgpu_device *adev); >> +int amdgpu_debugfs_wait_dump(struct amdgpu_device *adev); >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> index 3d601d5dd5af..44e54ea7af0f 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> @@ -3915,6 +3915,8 @@ static int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev, >> int i, r = 0; >> bool need_full_reset = *need_full_reset_arg; >> >> + amdgpu_debugfs_wait_dump(adev); >> + >> /* block all schedulers and reset given job's ring */ >> for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { >> struct amdgpu_ring *ring = adev->rings[i]; > _______________________________________________ > amd-gfx mailing list > amd-gfx@xxxxxxxxxxxxxxxxxxxxx > https://lists.freedesktop.org/mailman/listinfo/amd-gfx > |
_______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx