Re: Fw: [PATCH] drm/amdgpu: add amdgpu_timeout_ring_* file to debugfs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 14.06.23 um 21:20 schrieb Nicolai Hähnle:
Hi Christian,

Report the per-ring timeout in milliseconds and allow users to adjust
the timeout dynamically. This can be useful for debugging, e.g. to more
easily test whether a submission genuinely hangs or is just taking very
long, and to temporarily disable GPU recovery so that shader problems
can be examined in detail, including single-stepping through shader
code.

It feels a bit questionable to access ring->sched.timeout without any
locking -- under a C++ memory model it would technically be undefined
behavior. But it's not like a lot can go wrong here in practice, and
it's not clear to me what locking or atomics, if any, should be used.
Uh, that's very dangerous what you do here and wouldn't work in a whole
bunch of cases.
Please elaborate: *what* case doesn't work?

The core memory management can wait at any time for the GPU reset to finish.

If we set the timeout to infinity we risk just deadlocking the kernel.

See here as well: https://lpc.events/event/11/contributions/1115/



First of all GPU recovery is part of normal operation and necessary for
system stability. So disabling GPU recovery is actually not a good idea
in the first place.
That's a complete non-argument because the whole point of this is that
it is a debugging feature. You're using this when the system as a
whole (most likely a UMD component) is already broken in some way.
Putting this in debugfs is not an accident.


We already discussed that we probably need to taint the kernel if we do
so to indicate in crash logs that the system is not considered stable
any more. The problem was only that there wasn't an agreement on how to
do this.
I'd be happy to add kernel tainting if you tell me how.


Since this here now makes it even easier to disable GPU recovery it's
probably not the right approach.
Again, being able to disable GPU recovery is a crucial debugging
feature. We need to be able to inspect the live state of hung shaders,
and we need to be able to single-step through shaders. All of that
requires disabling GPU recovery.

Yeah, I'm perfectly aware of that. The problem is this is just *not* supported on Linux for graphics shaders.

What you can do is to run the shader with something like CWSR enabled (or an to be invented graphics equivalent). Since we are debugging the shader anyway that should be possible I think.

Forcing people to reboot just to be able to disable GPU recovery for
debugging is developer hostile.

Well, I think you misunderstood me. The suggestion was even to force them to re-compile the kernel driver to disable GPU recovery.

Disabling GPU recovery is *not* something you can do and expect the system to be stable.

The only case we can do that is when we attach a JTAG debugger in an AMD lab.

Regards,
Christian.


So again, if there really are cases where this "doesn't work" (and
those cases aren't just that your desktop will freeze -- that part is
intentional), then let's talk through it and see how to address them.

Thanks,
Nicolai


Regards,
Christian.

Signed-off-by: Nicolai Hähnle <nicolai.haehnle@xxxxxxx>
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 32 +++++++++++++++++++++++-
   1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index dc474b809604..32d223daa789 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -471,35 +471,65 @@ static ssize_t amdgpu_debugfs_ring_read(struct file *f, char __user *buf,

        return result;
   }

   static const struct file_operations amdgpu_debugfs_ring_fops = {
        .owner = THIS_MODULE,
        .read = amdgpu_debugfs_ring_read,
        .llseek = default_llseek
   };

+static int amdgpu_debugfs_timeout_ring_get(void *data, u64 *val) {
+     struct amdgpu_ring *ring = data;
+
+     if (ring->sched.timeout == MAX_SCHEDULE_TIMEOUT)
+             *val = 0;
+     else
+             *val = jiffies_to_msecs(ring->sched.timeout);
+
+     return 0;
+}
+
+static int amdgpu_debugfs_timeout_ring_set(void *data, u64 val) {
+     struct amdgpu_ring *ring = data;
+
+     if (val == 0)
+             ring->sched.timeout = MAX_SCHEDULE_TIMEOUT;
+     else
+             ring->sched.timeout = msecs_to_jiffies(val);
+
+     return 0;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(amdgpu_debugfs_timeout_ring_fops,
+                      amdgpu_debugfs_timeout_ring_get,
+                      amdgpu_debugfs_timeout_ring_set,
+                      "%llu\n");
+
   #endif

   void amdgpu_debugfs_ring_init(struct amdgpu_device *adev,
                              struct amdgpu_ring *ring)
   {
   #if defined(CONFIG_DEBUG_FS)
        struct drm_minor *minor = adev_to_drm(adev)->primary;
        struct dentry *root = minor->debugfs_root;
-     char name[32];
+     char name[40];

        sprintf(name, "amdgpu_ring_%s", ring->name);
        debugfs_create_file_size(name, S_IFREG | S_IRUGO, root, ring,
                                 &amdgpu_debugfs_ring_fops,
                                 ring->ring_size + 12);

+     sprintf(name, "amdgpu_timeout_ring_%s", ring->name);
+     debugfs_create_file(name, S_IFREG | S_IRUGO | S_IWUSR, root, ring,
+                         &amdgpu_debugfs_timeout_ring_fops);
   #endif
   }

   /**
    * amdgpu_ring_test_helper - tests ring and set sched readiness status
    *
    * @ring: ring to try the recovery on
    *
    * Tests ring and set sched readiness status
    *





[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux