On Thu, Jun 15, 2023 at 9:47 AM Christian König <ckoenig.leichtzumerken@xxxxxxxxx> wrote: > >> Uh, that's very dangerous what you do here and wouldn't work in a whole > >> bunch of cases. > > Please elaborate: *what* case doesn't work? > > The core memory management can wait at any time for the GPU reset to finish. > > If we set the timeout to infinity we risk just deadlocking the kernel. Okay, thanks. I may have seen some aspect of this before in cases where GPU reset failed and left processes in an unkillable state. I'll be honest, I've seen my fair share of exotic GPU hangs and to put it mildly I'm not impressed by the kernel's handling of them. Obviously you know much more about the intricacies of kernel memory management than I do, but the fact that processes can end up in an unkillable state for *any* GPU-related reason feels to me like the result of a bad design decision somewhere. But anyway, I'm not even asking you to fix those problems. All I'm asking you is to let me do *my* job, part of which is to help prevent GPU hangs from happening in the first place. For that, I need useful debugging facilities -- and so do others. > > Again, being able to disable GPU recovery is a crucial debugging > > feature. We need to be able to inspect the live state of hung shaders, > > and we need to be able to single-step through shaders. All of that > > requires disabling GPU recovery. > > Yeah, I'm perfectly aware of that. The problem is this is just *not* > supported on Linux for graphics shaders. > > What you can do is to run the shader with something like CWSR enabled > (or an to be invented graphics equivalent). Since we are debugging the > shader anyway that should be possible I think. > > > Forcing people to reboot just to be able to disable GPU recovery for > > debugging is developer hostile. > > Well, I think you misunderstood me. The suggestion was even to force > them to re-compile the kernel driver to disable GPU recovery. > > Disabling GPU recovery is *not* something you can do and expect the > system to be stable. > > The only case we can do that is when we attach a JTAG debugger in an AMD > lab. You're being *completely* unreasonable here. Even Windows(!) allows disabling GPU recovery at runtime from software, and Windows is usually far more developer hostile than Linux in these things. Seriously, this level of hostility against developers coming from you is not okay. Yes, it's a tool that has sharp edges. That is perfectly well understood. If we need to add warning labels then so be it. And if the details of *how* to change the timeout or disable GPU recovery at runtime should be changed, that too is fine. But it's an important tool. Can we please just move forward on this in a pragmatic fashion? Thanks, Nicolai > > Regards, > Christian. > > > > > So again, if there really are cases where this "doesn't work" (and > > those cases aren't just that your desktop will freeze -- that part is > > intentional), then let's talk through it and see how to address them. > > > > Thanks, > > Nicolai > > > > > >> Regards, > >> Christian. > >> > >>> Signed-off-by: Nicolai Hähnle <nicolai.haehnle@xxxxxxx> > >>> --- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 32 +++++++++++++++++++++++- > >>> 1 file changed, 31 insertions(+), 1 deletion(-) > >>> > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c > >>> index dc474b809604..32d223daa789 100644 > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c > >>> @@ -471,35 +471,65 @@ static ssize_t amdgpu_debugfs_ring_read(struct file *f, char __user *buf, > >>> > >>> return result; > >>> } > >>> > >>> static const struct file_operations amdgpu_debugfs_ring_fops = { > >>> .owner = THIS_MODULE, > >>> .read = amdgpu_debugfs_ring_read, > >>> .llseek = default_llseek > >>> }; > >>> > >>> +static int amdgpu_debugfs_timeout_ring_get(void *data, u64 *val) { > >>> + struct amdgpu_ring *ring = data; > >>> + > >>> + if (ring->sched.timeout == MAX_SCHEDULE_TIMEOUT) > >>> + *val = 0; > >>> + else > >>> + *val = jiffies_to_msecs(ring->sched.timeout); > >>> + > >>> + return 0; > >>> +} > >>> + > >>> +static int amdgpu_debugfs_timeout_ring_set(void *data, u64 val) { > >>> + struct amdgpu_ring *ring = data; > >>> + > >>> + if (val == 0) > >>> + ring->sched.timeout = MAX_SCHEDULE_TIMEOUT; > >>> + else > >>> + ring->sched.timeout = msecs_to_jiffies(val); > >>> + > >>> + return 0; > >>> +} > >>> + > >>> +DEFINE_DEBUGFS_ATTRIBUTE(amdgpu_debugfs_timeout_ring_fops, > >>> + amdgpu_debugfs_timeout_ring_get, > >>> + amdgpu_debugfs_timeout_ring_set, > >>> + "%llu\n"); > >>> + > >>> #endif > >>> > >>> void amdgpu_debugfs_ring_init(struct amdgpu_device *adev, > >>> struct amdgpu_ring *ring) > >>> { > >>> #if defined(CONFIG_DEBUG_FS) > >>> struct drm_minor *minor = adev_to_drm(adev)->primary; > >>> struct dentry *root = minor->debugfs_root; > >>> - char name[32]; > >>> + char name[40]; > >>> > >>> sprintf(name, "amdgpu_ring_%s", ring->name); > >>> debugfs_create_file_size(name, S_IFREG | S_IRUGO, root, ring, > >>> &amdgpu_debugfs_ring_fops, > >>> ring->ring_size + 12); > >>> > >>> + sprintf(name, "amdgpu_timeout_ring_%s", ring->name); > >>> + debugfs_create_file(name, S_IFREG | S_IRUGO | S_IWUSR, root, ring, > >>> + &amdgpu_debugfs_timeout_ring_fops); > >>> #endif > >>> } > >>> > >>> /** > >>> * amdgpu_ring_test_helper - tests ring and set sched readiness status > >>> * > >>> * @ring: ring to try the recovery on > >>> * > >>> * Tests ring and set sched readiness status > >>> * > > > -- Lerne, wie die Welt wirklich ist, aber vergiss niemals, wie sie sein sollte.