Re: Fw: [PATCH] drm/amdgpu: add amdgpu_timeout_ring_* file to debugfs

Christian König <ckoenig.leichtzumerken@xxxxxxxxx> · Thu, 15 Jun 2023 11:14:28 +0200

Am 15.06.23 um 10:55 schrieb Nicolai Hähnle:
On Thu, Jun 15, 2023 at 9:47 AM Christian König
<ckoenig.leichtzumerken@xxxxxxxxx> wrote:
Uh, that's very dangerous what you do here and wouldn't work in a whole
bunch of cases.
Please elaborate: *what* case doesn't work?
The core memory management can wait at any time for the GPU reset to finish.

If we set the timeout to infinity we risk just deadlocking the kernel.
Okay, thanks. I may have seen some aspect of this before in cases
where GPU reset failed and left processes in an unkillable state.

I'll be honest, I've seen my fair share of exotic GPU hangs and to put
it mildly I'm not impressed by the kernel's handling of them.

Yeah, and I completely agree with you. The whole situation around that 
is just horrible.

Obviously you know much more about the intricacies of kernel memory
management than I do, but the fact that processes can end up in an
unkillable state for *any* GPU-related reason feels to me like the
result of a bad design decision somewhere.

But anyway, I'm not even asking you to fix those problems. All I'm
asking you is to let me do *my* job, part of which is to help prevent
GPU hangs from happening in the first place. For that, I need useful
debugging facilities -- and so do others.


Again, being able to disable GPU recovery is a crucial debugging
feature. We need to be able to inspect the live state of hung shaders,
and we need to be able to single-step through shaders. All of that
requires disabling GPU recovery.
Yeah, I'm perfectly aware of that. The problem is this is just *not*
supported on Linux for graphics shaders.

What you can do is to run the shader with something like CWSR enabled
(or an to be invented graphics equivalent). Since we are debugging the
shader anyway that should be possible I think.

Forcing people to reboot just to be able to disable GPU recovery for
debugging is developer hostile.
Well, I think you misunderstood me. The suggestion was even to force
them to re-compile the kernel driver to disable GPU recovery.

Disabling GPU recovery is *not* something you can do and expect the
system to be stable.

The only case we can do that is when we attach a JTAG debugger in an AMD
lab.
You're being *completely* unreasonable here. Even Windows(!) allows
disabling GPU recovery at runtime from software, and Windows is
usually far more developer hostile than Linux in these things.
Seriously, this level of hostility against developers coming from you
is not okay.

Well, I'm not hostile against developers, but just realistic that this 
will lead to even more problems.

And I rather avoid problems with end-users than with developers because 
the later are usually the more skilled people.

As far as I can see that Windows allowed to disable GPU recovery was 
actually the source of the problem and is absolutely no argument to 
repeat the same mistake on Linux again.

Yes, it's a tool that has sharp edges. That is perfectly well
understood. If we need to add warning labels then so be it. And if the
details of *how* to change the timeout or disable GPU recovery at
runtime should be changed, that too is fine. But it's an important
tool. Can we please just move forward on this in a pragmatic fashion?

Yes and because of this I'm just rejecting this approach here.

Rebooting to disable GPU reset is perfectly fine and reasonable to do 
for a developer.

As I said the requirement I have for the other extreme is to make it a 
compile time only option and I'm trying to avoid that as well.

Regards,
Christian.


Thanks,
Nicolai


Regards,
Christian.

So again, if there really are cases where this "doesn't work" (and
those cases aren't just that your desktop will freeze -- that part is
intentional), then let's talk through it and see how to address them.

Thanks,
Nicolai


Regards,
Christian.

Signed-off-by: Nicolai Hähnle <nicolai.haehnle@xxxxxxx>
---
    drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 32 +++++++++++++++++++++++-
    1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index dc474b809604..32d223daa789 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -471,35 +471,65 @@ static ssize_t amdgpu_debugfs_ring_read(struct file *f, char __user *buf,

         return result;
    }

    static const struct file_operations amdgpu_debugfs_ring_fops = {
         .owner = THIS_MODULE,
         .read = amdgpu_debugfs_ring_read,
         .llseek = default_llseek
    };

+static int amdgpu_debugfs_timeout_ring_get(void *data, u64 *val) {
+     struct amdgpu_ring *ring = data;
+
+     if (ring->sched.timeout == MAX_SCHEDULE_TIMEOUT)
+             *val = 0;
+     else
+             *val = jiffies_to_msecs(ring->sched.timeout);
+
+     return 0;
+}
+
+static int amdgpu_debugfs_timeout_ring_set(void *data, u64 val) {
+     struct amdgpu_ring *ring = data;
+
+     if (val == 0)
+             ring->sched.timeout = MAX_SCHEDULE_TIMEOUT;
+     else
+             ring->sched.timeout = msecs_to_jiffies(val);
+
+     return 0;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(amdgpu_debugfs_timeout_ring_fops,
+                      amdgpu_debugfs_timeout_ring_get,
+                      amdgpu_debugfs_timeout_ring_set,
+                      "%llu\n");
+
    #endif

    void amdgpu_debugfs_ring_init(struct amdgpu_device *adev,
                               struct amdgpu_ring *ring)
    {
    #if defined(CONFIG_DEBUG_FS)
         struct drm_minor *minor = adev_to_drm(adev)->primary;
         struct dentry *root = minor->debugfs_root;
-     char name[32];
+     char name[40];

         sprintf(name, "amdgpu_ring_%s", ring->name);
         debugfs_create_file_size(name, S_IFREG | S_IRUGO, root, ring,
                                  &amdgpu_debugfs_ring_fops,
                                  ring->ring_size + 12);

+     sprintf(name, "amdgpu_timeout_ring_%s", ring->name);
+     debugfs_create_file(name, S_IFREG | S_IRUGO | S_IWUSR, root, ring,
+                         &amdgpu_debugfs_timeout_ring_fops);
    #endif
    }

    /**
     * amdgpu_ring_test_helper - tests ring and set sched readiness status
     *
     * @ring: ring to try the recovery on
     *
     * Tests ring and set sched readiness status
     *