Re: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Yes, an open file descriptor holds a reference to the driver module.

So it shouldn't be possible to unload the driver while it is open.

Christian.

Am 23.04.20 um 09:54 schrieb Liu, Monk:
Oh, looks if the daemon is opening the node KMD don't have a chance to enter the path of shutdown/unload driver, thus no chance to return "kmd unloading" to the app...

_____________________________________
Monk Liu|GPU Virtualization Team |AMD


-----Original Message-----
From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of Liu, Monk
Sent: Thursday, April 23, 2020 3:52 PM
To: Zhao, Jiange <Jiange.Zhao@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Pelloux-prayer, Pierre-eric <Pierre-eric.Pelloux-prayer@xxxxxxx>; Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>; Zhang, Hawking <Hawking.Zhang@xxxxxxx>
Subject: RE: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset

Hi Christian

Do you think we need to kill the daemon app if we do KMD unloading ? usually user need to close the app first and then the KMD could be unloaded

If we don't want to manually shutdown the daemon app we can do a "KILL" signal send to that process, or we can implement "read" and let app call "read()" to fetch information like:
1) xxx process hang
2) kmd unloading

And daemon can close() the node if it receives "kmd unloading" instead of doing the dump

Thanks

_____________________________________
Monk Liu|GPU Virtualization Team |AMD


-----Original Message-----
From: Zhao, Jiange <Jiange.Zhao@xxxxxxx>
Sent: Thursday, April 23, 2020 3:20 PM
To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Koenig, Christian <Christian.Koenig@xxxxxxx>; Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Pelloux-prayer, Pierre-eric <Pierre-eric.Pelloux-prayer@xxxxxxx>; Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Liu, Monk <Monk.Liu@xxxxxxx>; Zhao, Jiange <Jiange.Zhao@xxxxxxx>
Subject: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset

From: Jiange Zhao <Jiange.Zhao@xxxxxxx>

When GPU got timeout, it would notify an interested part of an opportunity to dump info before actual GPU reset.

A usermode app would open 'autodump' node under debugfs system and poll() for readable/writable. When a GPU reset is due, amdgpu would notify usermode app through wait_queue_head and give it 10 minutes to dump info.

After usermode app has done its work, this 'autodump' node is closed.
On node closure, amdgpu gets to know the dump is done through the completion that is triggered in release().

There is no write or read callback because necessary info can be obtained through dmesg and umr. Messages back and forth between usermode app and amdgpu are unnecessary.

Signed-off-by: Jiange Zhao <Jiange.Zhao@xxxxxxx>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h         |  9 +++
  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 85 +++++++++++++++++++++  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h |  1 +  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  2 +
  4 files changed, 97 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index bc1e0fd71a09..a505b547f242 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -724,6 +724,13 @@ struct amd_powerplay {
  	const struct amd_pm_funcs *pp_funcs;
  };
+struct amdgpu_autodump {
+	bool				registered;
+	struct completion		completed;
+	struct dentry			*dentry;
+	struct wait_queue_head		gpu_hang_wait;
+};
+
  #define AMDGPU_RESET_MAGIC_NUM 64
  #define AMDGPU_MAX_DF_PERFMONS 4
  struct amdgpu_device {
@@ -990,6 +997,8 @@ struct amdgpu_device {
  	char				product_number[16];
  	char				product_name[32];
  	char				serial[16];
+
+	struct amdgpu_autodump		autodump;
  };
static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device *bdev) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index 1a4894fa3693..cdd4bf00adee 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
@@ -74,8 +74,91 @@ int amdgpu_debugfs_add_files(struct amdgpu_device *adev,
  	return 0;
  }
+int amdgpu_debugfs_wait_dump(struct amdgpu_device *adev) { #if
+defined(CONFIG_DEBUG_FS)
+	int ret;
+	unsigned long tmo = 600*HZ;
+
+	if (!adev->autodump.registered)
+		return 0;
+
+	wake_up_interruptible(&adev->autodump.gpu_hang_wait);
+
+	ret = wait_for_completion_interruptible_timeout(&adev->autodump.completed, tmo);
+	if (ret == 0) { /* time out and dump tool still not finish its dump*/
+		pr_err("autodump: timeout before dump finished, move on to gpu recovery\n");
+		return -ETIMEDOUT;
+	}
+#endif
+	return 0;
+}
+
  #if defined(CONFIG_DEBUG_FS)
+static int amdgpu_debugfs_autodump_open(struct inode *inode, struct
+file *file) {
+	int ret;
+	struct amdgpu_device *adev;
+
+	ret = simple_open(inode, file);
+	if (ret)
+		return ret;
+
+	adev = file->private_data;
+	if (adev->autodump.registered == true)
+		return -EINVAL;
+
+	adev->autodump.registered = true;
+
+	return 0;
+}
+
+static int amdgpu_debugfs_autodump_release(struct inode *inode, struct
+file *file) {
+	struct amdgpu_device *adev = file->private_data;
+
+	complete(&adev->autodump.completed);
+	adev->autodump.registered = false;
+
+	return 0;
+}
+
+unsigned int amdgpu_debugfs_autodump_poll(struct file *file, struct
+poll_table_struct *poll_table) {
+	struct amdgpu_device *adev = file->private_data;
+
+	poll_wait(file, &adev->autodump.gpu_hang_wait, poll_table);
+
+	if (adev->in_gpu_reset)
+		return POLLIN | POLLRDNORM | POLLWRNORM;
+
+	return 0;
+}
+
+static const struct file_operations autodump_debug_fops = {
+	.owner = THIS_MODULE,
+	.open = amdgpu_debugfs_autodump_open,
+	.poll = amdgpu_debugfs_autodump_poll,
+	.release = amdgpu_debugfs_autodump_release, };
+
+static int amdgpu_debugfs_autodump_init(struct amdgpu_device *adev) {
+	struct dentry *entry;
+
+	init_completion(&adev->autodump.completed);
+	init_waitqueue_head(&adev->autodump.gpu_hang_wait);
+	adev->autodump.registered = false;
+
+	entry = debugfs_create_file("autodump", 0600,
+			adev->ddev->primary->debugfs_root,
+			adev, &autodump_debug_fops);
+	adev->autodump.dentry = entry;
+
+	return 0;
+}
+
  /**
   * amdgpu_debugfs_process_reg_op - Handle MMIO register reads/writes
   *
@@ -1434,6 +1517,8 @@ int amdgpu_debugfs_init(struct amdgpu_device *adev)
amdgpu_ras_debugfs_create_all(adev); + amdgpu_debugfs_autodump_init(adev);
+
  	return amdgpu_debugfs_add_files(adev, amdgpu_debugfs_list,
  					ARRAY_SIZE(amdgpu_debugfs_list));
  }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h
index de12d1101526..9428940a696d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h
@@ -40,3 +40,4 @@ int amdgpu_debugfs_add_files(struct amdgpu_device *adev,  int amdgpu_debugfs_fence_init(struct amdgpu_device *adev);  int amdgpu_debugfs_firmware_init(struct amdgpu_device *adev);  int amdgpu_debugfs_gem_init(struct amdgpu_device *adev);
+int amdgpu_debugfs_wait_dump(struct amdgpu_device *adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3d601d5dd5af..44e54ea7af0f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3915,6 +3915,8 @@ static int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
  	int i, r = 0;
  	bool need_full_reset  = *need_full_reset_arg;
+ amdgpu_debugfs_wait_dump(adev);
+
  	/* block all schedulers and reset given job's ring */
  	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
  		struct amdgpu_ring *ring = adev->rings[i];
--
2.20.1

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cmonk.liu%40amd.com%7C2d5beed35028403ebe1708d7e75b4353%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637232251473250650&amp;sdata=SmXKMH9LgbD5K2gkm6Vqysu%2FgvtHLtLoJpJGcLGFd%2F4%3D&amp;reserved=0

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx



[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux