Re: Fw: [PATCH] drm/amdgpu: add amdgpu_timeout_ring_* file to debugfs

Alex Deucher <alexdeucher@xxxxxxxxx> · Thu, 15 Jun 2023 09:14:29 -0400

On Thu, Jun 15, 2023 at 5:14 AM Christian König
<ckoenig.leichtzumerken@xxxxxxxxx> wrote:
>
> Am 15.06.23 um 10:55 schrieb Nicolai Hähnle:
> > On Thu, Jun 15, 2023 at 9:47 AM Christian König
> > <ckoenig.leichtzumerken@xxxxxxxxx> wrote:
> >>>> Uh, that's very dangerous what you do here and wouldn't work in a whole
> >>>> bunch of cases.
> >>> Please elaborate: *what* case doesn't work?
> >> The core memory management can wait at any time for the GPU reset to finish.
> >>
> >> If we set the timeout to infinity we risk just deadlocking the kernel.
> > Okay, thanks. I may have seen some aspect of this before in cases
> > where GPU reset failed and left processes in an unkillable state.
> >
> > I'll be honest, I've seen my fair share of exotic GPU hangs and to put
> > it mildly I'm not impressed by the kernel's handling of them.
>
> Yeah, and I completely agree with you. The whole situation around that
> is just horrible.
>
> > Obviously you know much more about the intricacies of kernel memory
> > management than I do, but the fact that processes can end up in an
> > unkillable state for *any* GPU-related reason feels to me like the
> > result of a bad design decision somewhere.
> >
> > But anyway, I'm not even asking you to fix those problems. All I'm
> > asking you is to let me do *my* job, part of which is to help prevent
> > GPU hangs from happening in the first place. For that, I need useful
> > debugging facilities -- and so do others.
> >
> >
> >>> Again, being able to disable GPU recovery is a crucial debugging
> >>> feature. We need to be able to inspect the live state of hung shaders,
> >>> and we need to be able to single-step through shaders. All of that
> >>> requires disabling GPU recovery.
> >> Yeah, I'm perfectly aware of that. The problem is this is just *not*
> >> supported on Linux for graphics shaders.
> >>
> >> What you can do is to run the shader with something like CWSR enabled
> >> (or an to be invented graphics equivalent). Since we are debugging the
> >> shader anyway that should be possible I think.
> >>
> >>> Forcing people to reboot just to be able to disable GPU recovery for
> >>> debugging is developer hostile.
> >> Well, I think you misunderstood me. The suggestion was even to force
> >> them to re-compile the kernel driver to disable GPU recovery.
> >>
> >> Disabling GPU recovery is *not* something you can do and expect the
> >> system to be stable.
> >>
> >> The only case we can do that is when we attach a JTAG debugger in an AMD
> >> lab.
> > You're being *completely* unreasonable here. Even Windows(!) allows
> > disabling GPU recovery at runtime from software, and Windows is
> > usually far more developer hostile than Linux in these things.
> > Seriously, this level of hostility against developers coming from you
> > is not okay.
>
> Well, I'm not hostile against developers, but just realistic that this
> will lead to even more problems.
>
> And I rather avoid problems with end-users than with developers because
> the later are usually the more skilled people.
>
> As far as I can see that Windows allowed to disable GPU recovery was
> actually the source of the problem and is absolutely no argument to
> repeat the same mistake on Linux again.
>
> > Yes, it's a tool that has sharp edges. That is perfectly well
> > understood. If we need to add warning labels then so be it. And if the
> > details of *how* to change the timeout or disable GPU recovery at
> > runtime should be changed, that too is fine. But it's an important
> > tool. Can we please just move forward on this in a pragmatic fashion?
>
> Yes and because of this I'm just rejecting this approach here.
>
> Rebooting to disable GPU reset is perfectly fine and reasonable to do
> for a developer.
>
> As I said the requirement I have for the other extreme is to make it a
> compile time only option and I'm trying to avoid that as well.

I have to agree with Nicolai here.  I think it's reasonable to be able
to enable a debugging mode at runtime if you are root.  There are
plenty of other dangerous things you can do via debugfs.  This is the
least of our worries.  Pretty much the first thing anyone one does
when trying to debug a hang is to disable reset.  This seems like a
nice compromise.

Alex

>
> Regards,
> Christian.
>
> >
> > Thanks,
> > Nicolai
> >
> >
> >> Regards,
> >> Christian.
> >>
> >>> So again, if there really are cases where this "doesn't work" (and
> >>> those cases aren't just that your desktop will freeze -- that part is
> >>> intentional), then let's talk through it and see how to address them.
> >>>
> >>> Thanks,
> >>> Nicolai
> >>>
> >>>
> >>>> Regards,
> >>>> Christian.
> >>>>
> >>>>> Signed-off-by: Nicolai Hähnle <nicolai.haehnle@xxxxxxx>
> >>>>> ---
> >>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 32 +++++++++++++++++++++++-
> >>>>>     1 file changed, 31 insertions(+), 1 deletion(-)
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> >>>>> index dc474b809604..32d223daa789 100644
> >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> >>>>> @@ -471,35 +471,65 @@ static ssize_t amdgpu_debugfs_ring_read(struct file *f, char __user *buf,
> >>>>>
> >>>>>          return result;
> >>>>>     }
> >>>>>
> >>>>>     static const struct file_operations amdgpu_debugfs_ring_fops = {
> >>>>>          .owner = THIS_MODULE,
> >>>>>          .read = amdgpu_debugfs_ring_read,
> >>>>>          .llseek = default_llseek
> >>>>>     };
> >>>>>
> >>>>> +static int amdgpu_debugfs_timeout_ring_get(void *data, u64 *val) {
> >>>>> +     struct amdgpu_ring *ring = data;
> >>>>> +
> >>>>> +     if (ring->sched.timeout == MAX_SCHEDULE_TIMEOUT)
> >>>>> +             *val = 0;
> >>>>> +     else
> >>>>> +             *val = jiffies_to_msecs(ring->sched.timeout);
> >>>>> +
> >>>>> +     return 0;
> >>>>> +}
> >>>>> +
> >>>>> +static int amdgpu_debugfs_timeout_ring_set(void *data, u64 val) {
> >>>>> +     struct amdgpu_ring *ring = data;
> >>>>> +
> >>>>> +     if (val == 0)
> >>>>> +             ring->sched.timeout = MAX_SCHEDULE_TIMEOUT;
> >>>>> +     else
> >>>>> +             ring->sched.timeout = msecs_to_jiffies(val);
> >>>>> +
> >>>>> +     return 0;
> >>>>> +}
> >>>>> +
> >>>>> +DEFINE_DEBUGFS_ATTRIBUTE(amdgpu_debugfs_timeout_ring_fops,
> >>>>> +                      amdgpu_debugfs_timeout_ring_get,
> >>>>> +                      amdgpu_debugfs_timeout_ring_set,
> >>>>> +                      "%llu\n");
> >>>>> +
> >>>>>     #endif
> >>>>>
> >>>>>     void amdgpu_debugfs_ring_init(struct amdgpu_device *adev,
> >>>>>                                struct amdgpu_ring *ring)
> >>>>>     {
> >>>>>     #if defined(CONFIG_DEBUG_FS)
> >>>>>          struct drm_minor *minor = adev_to_drm(adev)->primary;
> >>>>>          struct dentry *root = minor->debugfs_root;
> >>>>> -     char name[32];
> >>>>> +     char name[40];
> >>>>>
> >>>>>          sprintf(name, "amdgpu_ring_%s", ring->name);
> >>>>>          debugfs_create_file_size(name, S_IFREG | S_IRUGO, root, ring,
> >>>>>                                   &amdgpu_debugfs_ring_fops,
> >>>>>                                   ring->ring_size + 12);
> >>>>>
> >>>>> +     sprintf(name, "amdgpu_timeout_ring_%s", ring->name);
> >>>>> +     debugfs_create_file(name, S_IFREG | S_IRUGO | S_IWUSR, root, ring,
> >>>>> +                         &amdgpu_debugfs_timeout_ring_fops);
> >>>>>     #endif
> >>>>>     }
> >>>>>
> >>>>>     /**
> >>>>>      * amdgpu_ring_test_helper - tests ring and set sched readiness status
> >>>>>      *
> >>>>>      * @ring: ring to try the recovery on
> >>>>>      *
> >>>>>      * Tests ring and set sched readiness status
> >>>>>      *
> >
>