Re: Fw: [PATCH] drm/amdgpu: add amdgpu_timeout_ring_* file to debugfs

Nicolai Hähnle <nhaehnle@xxxxxxxxx> · Thu, 15 Jun 2023 10:55:14 +0200

On Thu, Jun 15, 2023 at 9:47 AM Christian König
<ckoenig.leichtzumerken@xxxxxxxxx> wrote:
> >> Uh, that's very dangerous what you do here and wouldn't work in a whole
> >> bunch of cases.
> > Please elaborate: *what* case doesn't work?
>
> The core memory management can wait at any time for the GPU reset to finish.
>
> If we set the timeout to infinity we risk just deadlocking the kernel.

Okay, thanks. I may have seen some aspect of this before in cases
where GPU reset failed and left processes in an unkillable state.

I'll be honest, I've seen my fair share of exotic GPU hangs and to put
it mildly I'm not impressed by the kernel's handling of them.
Obviously you know much more about the intricacies of kernel memory
management than I do, but the fact that processes can end up in an
unkillable state for *any* GPU-related reason feels to me like the
result of a bad design decision somewhere.

But anyway, I'm not even asking you to fix those problems. All I'm
asking you is to let me do *my* job, part of which is to help prevent
GPU hangs from happening in the first place. For that, I need useful
debugging facilities -- and so do others.

> > Again, being able to disable GPU recovery is a crucial debugging
> > feature. We need to be able to inspect the live state of hung shaders,
> > and we need to be able to single-step through shaders. All of that
> > requires disabling GPU recovery.
>
> Yeah, I'm perfectly aware of that. The problem is this is just *not*
> supported on Linux for graphics shaders.
>
> What you can do is to run the shader with something like CWSR enabled
> (or an to be invented graphics equivalent). Since we are debugging the
> shader anyway that should be possible I think.
>
> > Forcing people to reboot just to be able to disable GPU recovery for
> > debugging is developer hostile.
>
> Well, I think you misunderstood me. The suggestion was even to force
> them to re-compile the kernel driver to disable GPU recovery.
>
> Disabling GPU recovery is *not* something you can do and expect the
> system to be stable.
>
> The only case we can do that is when we attach a JTAG debugger in an AMD
> lab.

You're being *completely* unreasonable here. Even Windows(!) allows
disabling GPU recovery at runtime from software, and Windows is
usually far more developer hostile than Linux in these things.
Seriously, this level of hostility against developers coming from you
is not okay.

Yes, it's a tool that has sharp edges. That is perfectly well
understood. If we need to add warning labels then so be it. And if the
details of *how* to change the timeout or disable GPU recovery at
runtime should be changed, that too is fine. But it's an important
tool. Can we please just move forward on this in a pragmatic fashion?

Thanks,
Nicolai

>
> Regards,
> Christian.
>
> >
> > So again, if there really are cases where this "doesn't work" (and
> > those cases aren't just that your desktop will freeze -- that part is
> > intentional), then let's talk through it and see how to address them.
> >
> > Thanks,
> > Nicolai
> >
> >
> >> Regards,
> >> Christian.
> >>
> >>> Signed-off-by: Nicolai Hähnle <nicolai.haehnle@xxxxxxx>
> >>> ---
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 32 +++++++++++++++++++++++-
> >>>    1 file changed, 31 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> >>> index dc474b809604..32d223daa789 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> >>> @@ -471,35 +471,65 @@ static ssize_t amdgpu_debugfs_ring_read(struct file *f, char __user *buf,
> >>>
> >>>         return result;
> >>>    }
> >>>
> >>>    static const struct file_operations amdgpu_debugfs_ring_fops = {
> >>>         .owner = THIS_MODULE,
> >>>         .read = amdgpu_debugfs_ring_read,
> >>>         .llseek = default_llseek
> >>>    };
> >>>
> >>> +static int amdgpu_debugfs_timeout_ring_get(void *data, u64 *val) {
> >>> +     struct amdgpu_ring *ring = data;
> >>> +
> >>> +     if (ring->sched.timeout == MAX_SCHEDULE_TIMEOUT)
> >>> +             *val = 0;
> >>> +     else
> >>> +             *val = jiffies_to_msecs(ring->sched.timeout);
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static int amdgpu_debugfs_timeout_ring_set(void *data, u64 val) {
> >>> +     struct amdgpu_ring *ring = data;
> >>> +
> >>> +     if (val == 0)
> >>> +             ring->sched.timeout = MAX_SCHEDULE_TIMEOUT;
> >>> +     else
> >>> +             ring->sched.timeout = msecs_to_jiffies(val);
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +DEFINE_DEBUGFS_ATTRIBUTE(amdgpu_debugfs_timeout_ring_fops,
> >>> +                      amdgpu_debugfs_timeout_ring_get,
> >>> +                      amdgpu_debugfs_timeout_ring_set,
> >>> +                      "%llu\n");
> >>> +
> >>>    #endif
> >>>
> >>>    void amdgpu_debugfs_ring_init(struct amdgpu_device *adev,
> >>>                               struct amdgpu_ring *ring)
> >>>    {
> >>>    #if defined(CONFIG_DEBUG_FS)
> >>>         struct drm_minor *minor = adev_to_drm(adev)->primary;
> >>>         struct dentry *root = minor->debugfs_root;
> >>> -     char name[32];
> >>> +     char name[40];
> >>>
> >>>         sprintf(name, "amdgpu_ring_%s", ring->name);
> >>>         debugfs_create_file_size(name, S_IFREG | S_IRUGO, root, ring,
> >>>                                  &amdgpu_debugfs_ring_fops,
> >>>                                  ring->ring_size + 12);
> >>>
> >>> +     sprintf(name, "amdgpu_timeout_ring_%s", ring->name);
> >>> +     debugfs_create_file(name, S_IFREG | S_IRUGO | S_IWUSR, root, ring,
> >>> +                         &amdgpu_debugfs_timeout_ring_fops);
> >>>    #endif
> >>>    }
> >>>
> >>>    /**
> >>>     * amdgpu_ring_test_helper - tests ring and set sched readiness status
> >>>     *
> >>>     * @ring: ring to try the recovery on
> >>>     *
> >>>     * Tests ring and set sched readiness status
> >>>     *
> >
>

-- 
Lerne, wie die Welt wirklich ist,
aber vergiss niemals, wie sie sein sollte.