Re: [PATCH] Revert "drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole"

Marek Olšák <maraeo@xxxxxxxxx> · Thu, 4 Jan 2024 13:29:30 -0500

Hi,

I have received information that the original commit makes all 32-bit
userspace VA allocations fail, so UMDs like Mesa can't even initialize
and they either crash or fail to load. If TBA/TMA was relocated to the
32-bit address range, it would explain why UMDs can't allocate
anything in that range.

Marek

On Wed, Jan 3, 2024 at 2:50 PM Jay Cornwall <jay.cornwall@xxxxxxx> wrote:
>
> On 1/3/2024 12:58, Felix Kuehling wrote:
>
> > A segfault in Mesa seems to be a different issue from what's mentioned
> > in the commit message. I'd let Christian or Marek comment on
> > compatibility with graphics UMDs. I'm not sure why this patch would
> > affect them at all.
>
> I was referencing this issue in OpenCL/OpenGL interop, which certainly looked related:
>
> [   91.769002] amdgpu 0000:0a:00.0: amdgpu: bo 000000009bba4692 va 0x0800000000-0x08000001ff conflict with 0x0800000000-0x0800000002
> [   91.769141] ocltst[2781]: segfault at b2 ip 00007f3fb90a7c39 sp 00007ffd3c011ba0 error 4 in radeonsi_dri.so[7f3fb888e000+1196000] likely on CPU 15 (core 7, socket 0)
>
> >
> > Looking at the logs in the tickets, it looks like a fence reference
> > counting error. I don't see how Jay's patch could have caused that. I
> > made another change in that code recently that could make a difference
> > for this issue:
> >
> >     commit 8f08c5b24ced1be7eb49692e4816c1916233c79b
> >     Author: Felix Kuehling <Felix.Kuehling@xxxxxxx>
> >     Date:   Fri Oct 27 18:21:55 2023 -0400
> >
> >          drm/amdkfd: Run restore_workers on freezable WQs
> >
> >          Make restore workers freezable so we don't have to explicitly
> >     flush them
> >          in suspend and GPU reset code paths, and we don't accidentally
> >     try to
> >          restore BOs while the GPU is suspended. Not having to flush
> >     restore_work
> >          also helps avoid lock/fence dependencies in the GPU reset case
> >     where we're
> >          not allowed to wait for fences.
> >
> >          A side effect of this is, that we can now have multiple
> >     concurrent threads
> >          trying to signal the same eviction fence. Rework eviction fence
> >     signaling
> >          and replacement to account for that.
> >
> >          The GPU reset path can no longer rely on restore_process_worker
> >     to resume
> >          queues because evict/restore workers can run independently of
> >     it. Instead
> >          call a new restore_process_helper directly.
> >
> >          This is an RFC and request for testing.
> >
> >          v2:
> >          - Reworked eviction fence signaling
> >          - Introduced restore_process_helper
> >
> >          v3:
> >          - Handle unsignaled eviction fences in restore_process_bos
> >
> >          Signed-off-by: Felix Kuehling <Felix.Kuehling@xxxxxxx>
> >          Acked-by: Christian König <christian.koenig@xxxxxxx>
> >          Tested-by: Emily Deng <Emily.Deng@xxxxxxx>
> >          Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx>
> >
> >
> > FWIW, I built a plain 6.6 kernel, and was not able to reproduce the
> > crash with some simple tests.
> >
> > Regards,
> >    Felix
> >
> >
> >>
> >> So I agree, let's revert it.
> >>
> >> Reviewed-by: Jay Cornwall <jay.cornwall@xxxxxxx>
>