On 1/3/2024 12:58, Felix Kuehling wrote: > A segfault in Mesa seems to be a different issue from what's mentioned > in the commit message. I'd let Christian or Marek comment on > compatibility with graphics UMDs. I'm not sure why this patch would > affect them at all. I was referencing this issue in OpenCL/OpenGL interop, which certainly looked related: [ 91.769002] amdgpu 0000:0a:00.0: amdgpu: bo 000000009bba4692 va 0x0800000000-0x08000001ff conflict with 0x0800000000-0x0800000002 [ 91.769141] ocltst[2781]: segfault at b2 ip 00007f3fb90a7c39 sp 00007ffd3c011ba0 error 4 in radeonsi_dri.so[7f3fb888e000+1196000] likely on CPU 15 (core 7, socket 0) > > Looking at the logs in the tickets, it looks like a fence reference > counting error. I don't see how Jay's patch could have caused that. I > made another change in that code recently that could make a difference > for this issue: > > commit 8f08c5b24ced1be7eb49692e4816c1916233c79b > Author: Felix Kuehling <Felix.Kuehling@xxxxxxx> > Date: Fri Oct 27 18:21:55 2023 -0400 > > drm/amdkfd: Run restore_workers on freezable WQs > > Make restore workers freezable so we don't have to explicitly > flush them > in suspend and GPU reset code paths, and we don't accidentally > try to > restore BOs while the GPU is suspended. Not having to flush > restore_work > also helps avoid lock/fence dependencies in the GPU reset case > where we're > not allowed to wait for fences. > > A side effect of this is, that we can now have multiple > concurrent threads > trying to signal the same eviction fence. Rework eviction fence > signaling > and replacement to account for that. > > The GPU reset path can no longer rely on restore_process_worker > to resume > queues because evict/restore workers can run independently of > it. Instead > call a new restore_process_helper directly. > > This is an RFC and request for testing. > > v2: > - Reworked eviction fence signaling > - Introduced restore_process_helper > > v3: > - Handle unsignaled eviction fences in restore_process_bos > > Signed-off-by: Felix Kuehling <Felix.Kuehling@xxxxxxx> > Acked-by: Christian König <christian.koenig@xxxxxxx> > Tested-by: Emily Deng <Emily.Deng@xxxxxxx> > Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx> > > > FWIW, I built a plain 6.6 kernel, and was not able to reproduce the > crash with some simple tests. > > Regards, > Felix > > >> >> So I agree, let's revert it. >> >> Reviewed-by: Jay Cornwall <jay.cornwall@xxxxxxx>