Re: [PATCH v6 2/8] drm/ttm: Add ttm_bo_access

Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx> · Tue, 12 Nov 2024 17:22:07 +0100

On Tue, 2024-11-12 at 15:41 +0200, Joonas Lahtinen wrote:
> (+ Thomas)
> 
> Quoting Christian König (2024-11-12 11:23:36)
> > Am 11.11.24 um 23:45 schrieb Matthew Brost:
> > 
> >     [SNIP]
> > 
> >             So I think only way to allow interactive debugging is
> > to avoid the
> >             dma_fences. Curious to hear if there are ideas for
> > otherwise.
> > 
> >         You need to guarantee somehow that the process is taken
> > from the hardware so
> >         that the preemption fence can signal.
> > 
> > 
> >     Our preemption fences have this functionality.
> > 
> >     A preemption fence issues a suspend execution command to the
> > firmware. The
> >     firmware, in turn, attempts to preempt the workload. If it
> > doesn't respond
> >     within a specified period, it resets the hardware queue, sends
> > a message to KMD,
> >     bans the software queue, and signals the preemption fence.
> > 
> >     We provide even more protection than that. If, for some reason,
> > the firmware
> >     doesn't respond within a longer timeout period, the KMD
> > performs a device reset,
> >     ban the offending software queue(s), and will signal the
> > preemption fences.
> > 
> >     This flow remains the same whether a debugger is attached or,
> > for example, a
> >     user submits a 10-minute non-preemptable workload. In either
> > case, other
> >     processes are guaranteed to make forward progress.
> > 
> > 
> > Yeah that is pretty much the same argumentation I have heard before
> > and it
> > turned out to not be working.
> > 
> > 
> >     The example above illustrates the memory oversubscription case,
> > where two
> >     processes are using 51% of the memory.
> > 
> > 
> > That isn't even necessary. We have seen applications dying just
> > because the
> > core memory management tried to join back small pages into huge
> > pages in an
> > userptr.
> > 
> > That the core memory management jumps in and requests that the pre-
> > emption
> > fence signals can happen all the time.
> 
> Ouch. Does there happen to be a known reproducer for this behavior or
> maybe
> bug report?
> 
> > You can mitigate that a bit, Fedora for example disables joining
> > back small
> > pages into huge pages by default for example and we even had people
> > suggesting
> > to use mprotect() so that userptrs VMAs don't fork() any more
> > (which is of
> > course completely illegal).
> > 
> > But my long term take away is that you can't block all causes of
> > sudden
> > requests to let a pre-emption fence signal.
> 
> I think this problem equally applies to the LR-workloads like the EU
> debugging ones.
> 
> >     Another preemption scenario involves two processes sharing
> > hardware resources.
> >     Our firmware follows the same flow here. If an LR workload is
> > using a hardware
> >     resource and a DMA-fence workload is waiting, and if the LR
> > workload doesn't
> >     preempt the in a timely manner, the firmware issues a hardware
> > reset, notifies
> >     KMD, and bans the LR software queue. The DMA-fence workload
> > then can make
> >     forward progress
> > 
> >     With the above in mind, this is why I say that if a user tries
> > to run a game and
> >     a non-preemptable LR workload, either oversubscribing memory or
> > sharing hardware
> >     resources, it is unlikely to work well. However, I don't think
> > this is a common
> >     use case. I would expect that when a debugger is open, it is
> > typically by a
> >     power user who knows how to disable other GPU tasks (e.g., by
> > enabling software
> >     rendering or using a machine without any display).
> > 
> >     Given this, please to reconsider your position.
> > 
> > 
> > The key point here is that this isn't stable, you can do that as a
> > tech demo
> > but it can always be that debugging an application just randomly
> > dies. And
> > believe me AMD has tried this to a rather extreme extend as well.
> 
> It's not really only limited to the debuggable applications at all,
> the
> normal LR workloads are equally impacted as far as I understand. Just
> harder to catch the issue with LR-workloads if the pre-emption fence
> signaling is sporadic.
> 
> > What you could potentially work is to taint the kernel and make
> > sure that this
> > function is only available to user who absolutely know what they
> > are doing.
> > 
> > But I would say we can only allow that if all other options have
> > been exercised
> > and doing it like this is really the only option left.
> 
> It sounds like servicing the memory pre-empt fence by stealing the
> pages from underneath the workload would be the way to resolve this
> issue.
> 
> This has been extensively discussed already, but was expected to
> really
> only be needed for low-on-memory scenarios. However it now seems like
> the need is much earlier due to the random userptr page joining by
> core
> mm.

Just to clarify here:

In Long-Running mode with recoverable pagefaults enabled we don't have
any preempt-fences, but rather just zap the PTEs pointing to the
affected memory and flush TLB. So from a memory resource POW a
breakpoint should be safe, and no mmu notifier nor shrinker will be
blocked.

Nor will there be any jobs with published dma-fences depending on the
job blocked either temporarily by a pagefault or long-term by a
debugger breakpoint.

/Thomas

> 
> If that is done and the memory pre-empt fence is serviced even for
> debuggable contexts, do you have further concerns with the presented
> approach
> from dma-buf and drm/sched perspective?
> 
> Regards, Joonas
> 
> > 
> > Regards,
> > Christian.
> > 
> > 
> >         This means that a breakpoint or core dump doesn't halt GPU
> > threads, but
> >         rather suspends them. E.g. all running wave data is
> > collected into a state
> >         bag which can be restored later on.
> > 
> >         I was under the impression that those long running compute
> > threads do
> >         exactly that, but when the hardware can't switch out the
> > GPU thread/process
> >         while in a break then that isn't the case.
> > 
> >         As long as you don't find a way to avoid that this patch
> > set is a pretty
> >         clear NAK from my side as DMA-buf and TTM maintainer.
> > 
> > 
> >     I believe this is addressed above.
> > 
> >     Matt
> > 
> > 
> >         What might work is to keep the submission on the hardware
> > in the break state
> >         but forbid any memory access. This way you can signal your
> > preemption fence
> >         even when the hardware isn't made available.
> > 
> >         Before you continue XE setups a new pre-emption fence and
> > makes sure that
> >         all page tables etc... are up to date.
> > 
> >         Could be tricky to get this right if completion fence based
> > submissions are
> >         mixed in as well, but that gives you at least a direction
> > you could
> >         potentially go.
> > 
> >         Regards,
> >         Christian.
> > 
> > 
> >             Regards, Joonas
> > 
> > 
> >                 Regards,
> >                 Christian.
> > 
> > 
> >                     Some wash-up thoughts from me below, but
> > consider them fairly irrelevant
> >                     since I think the main driver for these big
> > questions here should be
> >                     gdb/userspace.
> > 
> > 
> >                         Quoting Christian König (2024-11-07
> > 11:44:33)
> > 
> >                             Am 06.11.24 um 18:00 schrieb Matthew
> > Brost:
> > 
> >                                   [SNIP]
> > 
> >                                   This is not a generic interface
> > that anyone can freely access. The same
> >                                   permissions used by ptrace are
> > checked when opening such an interface.
> >                                   See [1] [2].
> > 
> >                                  
> > [1]https://patchwork.freedesktop.org/patch/617470/?series=136572&re
> > v=2
> >                                  
> > [2]https://patchwork.freedesktop.org/patch/617471/?series=136572&re
> > v=2
> > 
> > 
> >                             Thanks a lot for those pointers, that
> > is exactly what I was looking for.
> > 
> >                             And yeah, it is what I feared. You are
> > re-implementing existing functionality,
> >                             but see below.
> > 
> >                         Could you elaborate on what this "existing
> > functionality" exactly is?
> >                         I do not think this functionality exists at
> > this time.
> > 
> >                         The EU debugging architecture for Xe
> > specifically avoids the need for GDB
> >                         to attach with ptrace to the CPU process or
> > interfere with the CPU process for
> >                         the debugging via parasitic threads or so.
> > 
> >                         Debugger connection is opened to the DRM
> > driver for given PID (which uses the
> >                         ptrace may access check for now) after
> > which the all DRM client of that
> >                         PID are exposed to the debugger process.
> > 
> >                         What we want to expose via that debugger
> > connection is the ability for GDB to
> >                         read/write the different GPU VM address
> > spaces (ppGTT for Intel GPUs) just like
> >                         the EU threads would see them. Note that
> > the layout of the ppGTT is
> >                         completely up to the userspace driver to
> > setup and is mostly only partially
> >                         equal to the CPU address space.
> > 
> >                         Specifically as part of reading/writing the
> > ppGTT for debugging purposes,
> >                         there are deep flushes needed: for example
> > flushing instruction cache
> >                         when adding/removing breakpoints.
> > 
> >                         Maybe that will explain the background. I
> > elaborate on this at the end some more.
> > 
> > 
> >                                           kmap/vmap are used
> > everywhere in the DRM subsystem to access BOs, so I’m
> >                                           failing to see the
> > problem with adding a simple helper based on existing
> >                                           code.
> > 
> >                                       What#s possible and often
> > done is to do kmap/vmap if you need to implement a
> >                                       CPU copy for scanout for
> > example or for copying/validating command buffers.
> >                                       But that usually requires
> > accessing the whole BO and has separate security
> >                                       checks.
> > 
> >                                       When you want to access only
> > a few bytes of a BO that sounds massively like
> >                                       a peek/poke like interface
> > and we have already rejected that more than once.
> >                                       There even used to be
> > standardized GEM IOCTLs for that which have been
> >                                       removed by now.
> > 
> >                         Referring to the explanation at top: These
> > IOCTL are not for the debugging target
> >                         process to issue. The peek/poke interface
> > is specifically for GDB only
> >                         to facilitate the emulation of memory
> > reads/writes on the GPU address
> >                         space as they were done by EUs themselves.
> > And to recap: for modifying
> >                         instructions for example (add/remove
> > breakpoint), extra level of cache flushing is
> >                         needed which is not available to regular
> > userspace.
> > 
> >                         I specifically discussed with Sima on the
> > difference before moving forward with this
> >                         design originally. If something has changed
> > since then, I'm of course happy to rediscuss.
> > 
> >                         However, if this code can't be added, not
> > sure how we would ever be able
> >                         to implement core dumps for GPU
> > threads/memory?
> > 
> > 
> >                                       If you need to access BOs
> > which are placed in not CPU accessible memory then
> >                                       implement the access callback
> > for ptrace, see amdgpu_ttm_access_memory for
> >                                       an example how to do this.
> > 
> >                         As also mentioned above, we don't work via
> > ptrace at all when it comes
> >                         to debugging the EUs. The only thing used
> > for now is the ptrace_may_access to
> >                         implement similar access restrictions as
> > ptrace has. This can be changed
> >                         to something else if needed.
> > 
> > 
> >                                   Ptrace access via
> > vm_operations_struct.access → ttm_bo_vm_access.
> > 
> >                                   This series renames
> > ttm_bo_vm_access to ttm_bo_access, with no code changes.
> > 
> >                                   The above function accesses a BO
> > via kmap if it is in SYSTEM / TT,
> >                                   which is existing code.
> > 
> >                                   This function is only exposed to
> > user space via ptrace permissions.
> > 
> >                         Maybe this sentence is what caused the
> > confusion.
> > 
> >                         Userspace is never exposed with peek/poke
> > interface, only the debugger
> >                         connection which is its own FD.
> > 
> > 
> >                                   In this series, we implement a
> > function [3] similar to
> >                                   amdgpu_ttm_access_memory for the
> > TTM vfunc access_memory. What is
> >                                   missing is non-visible CPU memory
> > access, similar to
> >                                   amdgpu_ttm_access_memory_sdma.
> > This will be addressed in a follow-up and
> >                                   was omitted in this series given
> > its complexity.
> > 
> >                                   So, this looks more or less
> > identical to AMD's ptrace implementation,
> >                                   but in GPU address space. Again,
> > I fail to see what the problem is here.
> >                                   What am I missing?
> > 
> > 
> >                             The main question is why can't you use
> > the existing interfaces directly?
> > 
> >                         We're not working on the CPU address space
> > or BOs. We're working
> >                         strictly on the GPU address space as would
> > be seen by an EU thread if it
> >                         accessed address X.
> > 
> > 
> >                             Additional to the peek/poke interface
> > of ptrace Linux has the pidfd_getfd
> >                             system call, see
> > herehttps://man7.org/linux/man-pages/man2/pidfd_getfd.2.html.
> > 
> >                             The pidfd_getfd() allows to dup() the
> > render node file descriptor into your gdb
> >                             process. That in turn gives you all the
> > access you need from gdb, including
> >                             mapping BOs and command submission on
> > behalf of the application.
> > 
> >                         We're not operating on the CPU address
> > space nor are we operating on BOs
> >                         (there is no concept of BO in the EU debug
> > interface). Each VMA in the VM
> >                         could come from anywhere, only the start
> > address and size matter. And
> >                         neither do we need to interfere with the
> > command submission of the
> >                         process under debug.
> > 
> > 
> >                             As far as I can see that allows for the
> > same functionality as the eudebug
> >                             interface, just without any driver
> > specific code messing with ptrace
> >                             permissions and peek/poke interfaces.
> > 
> >                             So the question is still why do you
> > need the whole eudebug interface in the
> >                             first place? I might be missing
> > something, but that seems to be superfluous
> >                             from a high level view.
> > 
> >                         Recapping from above. It is to allow the
> > debugging of EU threads per DRM
> >                         client, completely independent of the CPU
> > process. If ptrace_may_acces
> >                         is the sore point, we could consider other
> > permission checks, too. There
> >                         is no other connection to ptrace in this
> > architecture as single
> >                         permission check to know if PID is fair
> > game to access by debugger
> >                         process.
> > 
> >                         Why no parasitic thread or ptrace: Going
> > forward, binding the EU debugging to
> >                         the DRM client would also pave way for
> > being able to extend core kernel generated
> >                         core dump with each DRM client's EU
> > thread/memory dump. We have similar
> >                         feature called "Offline core dump" enabled
> > in the downstream public
> >                         trees for i915, where we currently attach
> > the EU thread dump to i915 error state
> >                         and then later combine i915 error state
> > with CPU core dump file with a
> >                         tool.
> > 
> >                         This is relatively little amount of extra
> > code, as this baseline series
> >                         already introduces GDB the ability to
> > perform the necessary actions.
> >                         It's just the matter of kernel driver
> > calling: "stop all threads", then
> >                         copying the memory map and memory contents
> > for GPU threads, just like is
> >                         done for CPU threads.
> > 
> >                         With parasitic thread injection, not sure
> > if there is such way forward,
> >                         as it would seem to require to inject quite
> > abit more logic to core kernel?
> > 
> > 
> >                             It's true that the AMD KFD part has
> > still similar functionality, but that is
> >                             because of the broken KFD design of
> > tying driver state to the CPU process
> >                             (which makes it inaccessible for gdb
> > even with imported render node fd).
> > 
> >                             Both Sima and I (and partially Dave as
> > well) have pushed back on the KFD
> >                             approach. And the long term plan is to
> > get rid of such device driver specific
> >                             interface which re-implement existing
> > functionality just differently.
> > 
> >                         Recapping, this series is not adding it
> > back. The debugger connection
> >                         is a separate FD from the DRM one, with
> > separate IOCTL set. We don't allow
> >                         the DRM FD any new operations based on
> > ptrace is attached or not. We
> >                         don't ever do that check even.
> > 
> >                         We only restrict the opening of the
> > debugger connection to given PID with
> >                         ptrace_may_access check for now. That can
> > be changed to something else,
> >                         if necessary.
> > 
> >                     Yeah I think unnecessarily tying gpu processes
> > to cpu processes is a bad
> >                     thing, least because even today all the svm
> > discussions we have still hit
> >                     clear use-cases, where a 1:1 match is not
> > wanted (like multiple gpu svm
> >                     sections with offsets). Not even speaking of
> > all the gpu usecases where
> >                     the gpu vm space is still entirely independent
> > of the cpu side.
> > 
> >                     So that's why I think this entirely separate
> > approach looks like the right
> >                     one, with ptrace_may_access as the access
> > control check to make sure we
> >                     match ptrace on the cpu side.
> > 
> >                     But there's very obviously a bikeshed to be had
> > on what the actual uapi
> >                     should look like, especially how gdb opens up a
> > gpu debug access fd. But I
> >                     also think that's not much on drm to decide,
> > but whatever gdb wants. And
> >                     then we aim for some consistency on that
> > lookup/access control part
> >                     (ideally, I might be missing some reasons why
> > this is a bad idea) across
> >                     drm drivers.
> > 
> > 
> >                             So you need to have a really really
> > good explanation why the eudebug interface
> >                             is actually necessary.
> > 
> >                         TL;DR The main point is to decouple the
> > debugging of the EU workloads from the
> >                         debugging of the CPU process. This avoids
> > the interference with the CPU process with
> >                         parasitic thread injection. Further this
> > also allows generating a core dump
> >                         without any GDB connected. There are also
> > many other smaller pros/cons
> >                         which can be discussed but for the context
> > of this patch, this is the
> >                         main one.
> > 
> >                         So unlike parasitic thread injection, we
> > don't unlock any special IOCTL for
> >                         the process under debug to be performed by
> > the parasitic thread, but we
> >                         allow the minimal set of operations to be
> > performed by GDB as if those were
> >                         done on the EUs themselves.
> > 
> >                         One can think of it like the minimal subset
> > of ptrace but for EU threads,
> >                         not the CPU threads. And thus, building on
> > this it's possible to extend
> >                         the core kernel generated core dumps with
> > DRM specific extension which
> >                         would contain the EU thread/memory dump.
> > 
> >                     It might be good to document (in that debugging
> > doc patch probably) why
> >                     thread injection is not a great option, and why
> > the tradeoffs for
> >                     debugging are different than for for
> > checkpoint/restore, where with CRIU
> >                     we landed on doing most of this in userspace,
> > and often requiring
> >                     injection threads to make it all work.
> > 
> >                     Cheers, Sima
> > 
> > 
> >                         Regards, Joonas
> > 
> > 
> >                             Regards,
> >                             Christian.
> > 
> > 
> > 
> >                                   Matt
> > 
> >                                  
> > [3]https://patchwork.freedesktop.org/patch/622520/?series=140200&re
> > v=6
> > 
> > 
> >                                       Regards,
> >                                       Christian.
> > 
> > 
> >                                           Matt
> > 
> > 
> >                                               Regards,
> >                                               Christian.
> > 
> > 
> >