Hi, I have received information that the original commit makes all 32-bit userspace VA allocations fail, so UMDs like Mesa can't even initialize and they either crash or fail to load. If TBA/TMA was relocated to the 32-bit address range, it would explain why UMDs can't allocate anything in that range. Marek On Wed, Jan 3, 2024 at 2:50 PM Jay Cornwall <jay.cornwall@xxxxxxx> wrote: > > On 1/3/2024 12:58, Felix Kuehling wrote: > > > A segfault in Mesa seems to be a different issue from what's mentioned > > in the commit message. I'd let Christian or Marek comment on > > compatibility with graphics UMDs. I'm not sure why this patch would > > affect them at all. > > I was referencing this issue in OpenCL/OpenGL interop, which certainly looked related: > > [ 91.769002] amdgpu 0000:0a:00.0: amdgpu: bo 000000009bba4692 va 0x0800000000-0x08000001ff conflict with 0x0800000000-0x0800000002 > [ 91.769141] ocltst[2781]: segfault at b2 ip 00007f3fb90a7c39 sp 00007ffd3c011ba0 error 4 in radeonsi_dri.so[7f3fb888e000+1196000] likely on CPU 15 (core 7, socket 0) > > > > > Looking at the logs in the tickets, it looks like a fence reference > > counting error. I don't see how Jay's patch could have caused that. I > > made another change in that code recently that could make a difference > > for this issue: > > > > commit 8f08c5b24ced1be7eb49692e4816c1916233c79b > > Author: Felix Kuehling <Felix.Kuehling@xxxxxxx> > > Date: Fri Oct 27 18:21:55 2023 -0400 > > > > drm/amdkfd: Run restore_workers on freezable WQs > > > > Make restore workers freezable so we don't have to explicitly > > flush them > > in suspend and GPU reset code paths, and we don't accidentally > > try to > > restore BOs while the GPU is suspended. Not having to flush > > restore_work > > also helps avoid lock/fence dependencies in the GPU reset case > > where we're > > not allowed to wait for fences. > > > > A side effect of this is, that we can now have multiple > > concurrent threads > > trying to signal the same eviction fence. Rework eviction fence > > signaling > > and replacement to account for that. > > > > The GPU reset path can no longer rely on restore_process_worker > > to resume > > queues because evict/restore workers can run independently of > > it. Instead > > call a new restore_process_helper directly. > > > > This is an RFC and request for testing. > > > > v2: > > - Reworked eviction fence signaling > > - Introduced restore_process_helper > > > > v3: > > - Handle unsignaled eviction fences in restore_process_bos > > > > Signed-off-by: Felix Kuehling <Felix.Kuehling@xxxxxxx> > > Acked-by: Christian König <christian.koenig@xxxxxxx> > > Tested-by: Emily Deng <Emily.Deng@xxxxxxx> > > Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx> > > > > > > FWIW, I built a plain 6.6 kernel, and was not able to reproduce the > > crash with some simple tests. > > > > Regards, > > Felix > > > > > >> > >> So I agree, let's revert it. > >> > >> Reviewed-by: Jay Cornwall <jay.cornwall@xxxxxxx> >