Re: [PATCH v8 1/2] drm/amdgpu: add debugfs for reset registers list

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 21.02.22 um 11:06 schrieb Somalapuram, Amaranath:
[AMD Official Use Only]


On 2/21/2022 2:45 PM, Christian König wrote:

Am 21.02.22 um 08:15 schrieb Somalapuram Amaranath:
List of register populated for dump collection during the GPU reset.

Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@xxxxxxx>
---
   drivers/gpu/drm/amd/amdgpu/amdgpu.h         |   4 +
   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 114 ++++++++++++++++++++
   2 files changed, 118 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index b85b67a88a3d..6e35f2c4c869 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1097,6 +1097,10 @@ struct amdgpu_device {
         struct amdgpu_reset_control     *reset_cntl;
       uint32_t ip_versions[HW_ID_MAX][HWIP_MAX_INSTANCE];
+
+    /* reset dump register */
+    uint32_t            *reset_dump_reg_list;
+    int                             num_regs;
   };
     static inline struct amdgpu_device *drm_to_adev(struct drm_device
*ddev)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index 164d6a9e9fbb..14ad9610f805 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
@@ -1609,6 +1609,118 @@ DEFINE_DEBUGFS_ATTRIBUTE(fops_ib_preempt, NULL,
   DEFINE_DEBUGFS_ATTRIBUTE(fops_sclk_set, NULL,
               amdgpu_debugfs_sclk_set, "%llu\n");
   +static ssize_t amdgpu_reset_dump_register_list_read(struct file *f,
+                char __user *buf, size_t size, loff_t *pos)
+{
+    struct amdgpu_device *adev = (struct amdgpu_device
*)file_inode(f)->i_private;
+    char reg_offset[11];
+    uint32_t num_regs;
+    int i, ret, len = 0;
+
+    if (*pos)
+        return 0;
+
+    ret = down_read_killable(&adev->reset_sem);
+
+    if (ret)
+        return ret;
+
+    num_regs = adev->num_regs;
+
+    up_read(&adev->reset_sem);
+
+    if (num_regs == 0)
+        return 0;
I think we should drop that cause it just avoids the final \n.

ok.
+
+    for (i = 0; i < num_regs; i++) {
That's pretty close, but one problem is still that it is possible that
the number of register becomes much smaller while this loop runs.

Try it like this instead:

down_read_killable(...)
for (i = 0; i < adev->num_regs; ++i) {
     sprintf(...)
     up_read(...);

     copy_to_user(

     down_read_killable(...)
}
up_read().

I created local num_regs to avoid lock ousted the loop. I guess you me
to remove ?

so we can hold up_read inside the loop ?

Yes to both. See num_regs and the pointer always needs to be consistent.

In other words you need to read both while in the same instance of the critical section:

down..
num_regs =...
up
...
down
x = array[num_regs]
up

That above is illegal since you read num_regs and the array in two different critical sections.

But when you do:
down..
for(i=0; i < num_regs; ++i) {
    x = array[i]
    up..
    ....
    down..
}
up...

You still have multiple critical sections, but the read of both num_regs and the array happens in the same one and because of that this is legal.

+
+        ret = down_read_killable(&adev->reset_sem);
+
+        if (ret)
+            return ret;
+
+        sprintf(reg_offset, "0x%x ", adev->reset_dump_reg_list[i]);
+
+        up_read(&adev->reset_sem);
+
+        ret = copy_to_user(buf + len, reg_offset, strlen(reg_offset));
+
+        if (ret)
+            return -EFAULT;
+
+        len += strlen(reg_offset);
+    }
+
+    ret = copy_to_user(buf + len, "\n", 1);
+
+    if (ret)
+        return -EFAULT;
+
+    len++;
+    *pos += len;
+
+    return len;
+}
+
+static ssize_t amdgpu_reset_dump_register_list_write(struct file *f,
+            const char __user *buf, size_t size, loff_t *pos)
+{
+    struct amdgpu_device *adev = (struct amdgpu_device
*)file_inode(f)->i_private;
+    char *reg_offset, *reg, reg_temp[11];
+    uint32_t *tmp_list;
+    int ret, i = 0, len = 0;
+
+    do {
+        reg_offset = reg_temp;
+        memset(reg_offset, 0, 11);
+        ret = copy_from_user(reg_offset, buf + len, min(11,
((int)size-len)));
+
+        if (ret) {
+            kfree(tmp_list);
+            return -EFAULT;
+        }
+
+        reg = strsep(&reg_offset, " ");
+        tmp_list = krealloc_array(tmp_list,
+                    1, sizeof(uint32_t), GFP_KERNEL);
+        ret = kstrtouint(reg, 16, &tmp_list[i]);
+
+        if (ret) {
+            kfree(tmp_list);
+            return -EFAULT;
+        }
+
+        len += strlen(reg) + 1;
+        i++;
+
+    } while (len < size);
+
+    ret = down_read_killable(&adev->reset_sem);
+
+    if (ret) {
+        kfree(tmp_list);
+        return ret;
+    }
+
+    kfree(adev->reset_dump_reg_list);
+
+    swap(adev->reset_dump_reg_list, tmp_list);
Just an assignment is sufficient here if you do the kfree before since
tmp_list isn't used after that.
This is required. what happens when the function is called for the
second time (the old tmp_list will never be freed)

tmp_list is also freed as its can return from the middle of the loop and
tmp_list  will never be freed.

The swap macro doesn't frees anything, a call to swap(a, b) just exchanges a and b. If b isn't used after that it's the same as a=b.

What you can do and is usually considered rather elegant is something like this:

r = down_write_killable(..)
if (r)
    goto error_free;

swap(adev->regs, tmp);
adev->num_regs = i;
up_write();

error_free:
kfree(tmp);
return r;

Since swap() exchanges the old array with the new one the following kfree() will always be correct. In other words in the error case it frees the temporary one and in the good case it frees the replaced one.


+    adev->num_regs = i;
+
+    up_read(&adev->reset_sem);
This should be a down_write_killable() and up_write() to avoid
concurrent reads.
I tried but some time it was deadlock while testing. (some on in driver
holding reset_sem).

Let me test it once again.

Try to enable CONFIG_LOCKDEP (under kernel hacking) that usually yields a nice backtrace into the system log when you do something wrong.

Apart from that the write function now looks clean.

do we need down_read_killable in the below function (second patch) ?


+static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
+{
+	uint32_t reg_value;
+	int i;
+
+	for (i = 0; i < adev->num_regs; i++) {
+		reg_value = RREG32(adev->reset_dump_reg_list[i]);
+		trace_amdgpu_reset_reg_dumps(adev->reset_dump_reg_list[i], reg_value);
+	}
+
+	return 0;
+}

I don't think so since that function should be called during reset with the reset_sem held.

What you can do is to add a lockdep_assert_held(adev->reset_sem); which will give you a nice warning when somebody calls this function without holding the lock.

Regards,
Christian.


Regards,
Christian.

+
+    return size;
+}
+
+
+
+static const struct file_operations amdgpu_reset_dump_register_list = {
+    .owner = THIS_MODULE,
+    .read = amdgpu_reset_dump_register_list_read,
+    .write = amdgpu_reset_dump_register_list_write,
+    .llseek = default_llseek
+};
+
   int amdgpu_debugfs_init(struct amdgpu_device *adev)
   {
       struct dentry *root = adev_to_drm(adev)->primary->debugfs_root;
@@ -1672,6 +1784,8 @@ int amdgpu_debugfs_init(struct amdgpu_device
*adev)
                   &amdgpu_debugfs_test_ib_fops);
       debugfs_create_file("amdgpu_vm_info", 0444, root, adev,
                   &amdgpu_debugfs_vm_info_fops);
+    debugfs_create_file("amdgpu_reset_dump_register_list", 0644,
root, adev,
+                &amdgpu_reset_dump_register_list);
         adev->debugfs_vbios_blob.data = adev->bios;
       adev->debugfs_vbios_blob.size = adev->bios_size;
>




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux