Re: [PATCH v6 2/8] drm/ttm: Add ttm_bo_access

Matthew Brost <matthew.brost@xxxxxxxxx> · Mon, 11 Nov 2024 14:45:51 -0800

On Mon, Nov 11, 2024 at 04:54:57PM +0100, Christian König wrote:
> Am 11.11.24 um 15:00 schrieb Joonas Lahtinen:
> > Quoting Christian König (2024-11-11 13:34:12)
> > > Am 11.11.24 um 11:10 schrieb Simona Vetter:
> > > > On Mon, Nov 11, 2024 at 10:00:17AM +0200, Joonas Lahtinen wrote:
> > > > > Back from some time off and will try to answer below.
> > > > > 
> > > > > Adding Dave and Sima as this topic has been previously discussed to some
> > > > > extent and will be good to reach common understanding about what the
> > > > > series is trying to do and what is the difference to the AMD debugging
> > > > > model.
> > > > I chatted about this thread a bit on irc with folks, and I think an
> > > > orthogonal issue is the question, what should be in ttm-utils? I've asked
> > > > Matt to type up a DOC patch once we have some consensus, since imo the
> > > > somewhat lackluster documentation situation for ttm is also somewhat a
> > > > cause for these big threads on various different topics. Aside from the
> > > > fact that gpu memory management is just hard.
> > > > 
> > > > On the uapi/design aspect, I think this would serve well with a patch to
> > > > drm-uapi.rst that adds a debugging section? At least once we have some
> > > > rough consensus across drivers, and more importantly userspace in the form
> > > > of gdb upstream (at least I'm not aware of any other upstream debugger
> > > > patches, I think amd's rocm stuff is also gdb-only).
> > > Yeah that seems to be a really good idea. Similar design ideas came up
> > > AMD internally as well but where dropped after pointing people to
> > > pidfd_getfd().
> > > 
> > > But the bigger problem seems to be that the design doesn't seems to take
> > > the dma_fence requirements into account.
> > Where would you deduce that?
> 
> XE is based on a preemption fence based memory management.
> 
> > We specifically limit the debugging to Long Running contexts which don't
> > depend on dma_fences.
> 
> That doesn't matter.
> 
> As long as you don't have a page fault (HMM) based memory management you
> still have that inter dependency and at least the public available XE code
> doesn't seem to have that.
> 
> > > In other words attaching gdb to a pid seems to stop the GPU thread of
> > > this pid without waiting for the XE preemption nor end of operation fence.
> > > 
> > > I mean if the GPU threads are preempted that could work, but yeah not
> > > like this :)
> > For us, hitting a breakpoint inside the workload would always violate
> > any dma_fence timeout for the submitted workload, as the HW context can't
> > be switched out while in the breakpoint.
> 
> That is clearly *not* something you can do without changing your memory
> management.
> 
> > For any dma_fence workload the guarantee is that that it completes
> > within reasonable time after submission (guaranteed by the submitter). I
> > don't see how you could really allow interactive debugging of a
> > breakpoint under those restrictions anyway even if pre-emption was
> > supported as the workload would not finish in <10 seconds?
> 
> Yeah that is the whole point, this is impossible as far as we know.
> 
> > For i915 we did have the "pre-emptable but indefinitely long dma_fence workloads"
> > concept at one point and that was rejected after the lengthy discussion.
> 
> I mean I tried that more than once myself and we have multiple requests for
> this on the AMD side from customers. So far nobody came up with a solution
> which actually works correctly.
> 
> > So I think only way to allow interactive debugging is to avoid the
> > dma_fences. Curious to hear if there are ideas for otherwise.
> 
> You need to guarantee somehow that the process is taken from the hardware so
> that the preemption fence can signal.
> 

Our preemption fences have this functionality.

A preemption fence issues a suspend execution command to the firmware. The
firmware, in turn, attempts to preempt the workload. If it doesn't respond
within a specified period, it resets the hardware queue, sends a message to KMD,
bans the software queue, and signals the preemption fence.

We provide even more protection than that. If, for some reason, the firmware
doesn't respond within a longer timeout period, the KMD performs a device reset,
ban the offending software queue(s), and will signal the preemption fences.

This flow remains the same whether a debugger is attached or, for example, a
user submits a 10-minute non-preemptable workload. In either case, other
processes are guaranteed to make forward progress.

The example above illustrates the memory oversubscription case, where two
processes are using 51% of the memory.

Another preemption scenario involves two processes sharing hardware resources.
Our firmware follows the same flow here. If an LR workload is using a hardware
resource and a DMA-fence workload is waiting, and if the LR workload doesn't
preempt the in a timely manner, the firmware issues a hardware reset, notifies
KMD, and bans the LR software queue. The DMA-fence workload then can make
forward progress

With the above in mind, this is why I say that if a user tries to run a game and
a non-preemptable LR workload, either oversubscribing memory or sharing hardware
resources, it is unlikely to work well. However, I don't think this is a common
use case. I would expect that when a debugger is open, it is typically by a
power user who knows how to disable other GPU tasks (e.g., by enabling software
rendering or using a machine without any display).

Given this, please to reconsider your position.

> This means that a breakpoint or core dump doesn't halt GPU threads, but
> rather suspends them. E.g. all running wave data is collected into a state
> bag which can be restored later on.
> 
> I was under the impression that those long running compute threads do
> exactly that, but when the hardware can't switch out the GPU thread/process
> while in a break then that isn't the case.
> 
> As long as you don't find a way to avoid that this patch set is a pretty
> clear NAK from my side as DMA-buf and TTM maintainer.
> 

I believe this is addressed above.

Matt

> What might work is to keep the submission on the hardware in the break state
> but forbid any memory access. This way you can signal your preemption fence
> even when the hardware isn't made available.
> 
> Before you continue XE setups a new pre-emption fence and makes sure that
> all page tables etc... are up to date.
> 
> Could be tricky to get this right if completion fence based submissions are
> mixed in as well, but that gives you at least a direction you could
> potentially go.
> 
> Regards,
> Christian.
> 
> > 
> > Regards, Joonas
> > 
> > > Regards,
> > > Christian.
> > > 
> > > > Some wash-up thoughts from me below, but consider them fairly irrelevant
> > > > since I think the main driver for these big questions here should be
> > > > gdb/userspace.
> > > > 
> > > > > Quoting Christian König (2024-11-07 11:44:33)
> > > > > > Am 06.11.24 um 18:00 schrieb Matthew Brost:
> > > > > > 
> > > > > >       [SNIP]
> > > > > > 
> > > > > >       This is not a generic interface that anyone can freely access. The same
> > > > > >       permissions used by ptrace are checked when opening such an interface.
> > > > > >       See [1] [2].
> > > > > > 
> > > > > >       [1]https://patchwork.freedesktop.org/patch/617470/?series=136572&rev=2
> > > > > >       [2]https://patchwork.freedesktop.org/patch/617471/?series=136572&rev=2
> > > > > > 
> > > > > > 
> > > > > > Thanks a lot for those pointers, that is exactly what I was looking for.
> > > > > > 
> > > > > > And yeah, it is what I feared. You are re-implementing existing functionality,
> > > > > > but see below.
> > > > > Could you elaborate on what this "existing functionality" exactly is?
> > > > > I do not think this functionality exists at this time.
> > > > > 
> > > > > The EU debugging architecture for Xe specifically avoids the need for GDB
> > > > > to attach with ptrace to the CPU process or interfere with the CPU process for
> > > > > the debugging via parasitic threads or so.
> > > > > 
> > > > > Debugger connection is opened to the DRM driver for given PID (which uses the
> > > > > ptrace may access check for now) after which the all DRM client of that
> > > > > PID are exposed to the debugger process.
> > > > > 
> > > > > What we want to expose via that debugger connection is the ability for GDB to
> > > > > read/write the different GPU VM address spaces (ppGTT for Intel GPUs) just like
> > > > > the EU threads would see them. Note that the layout of the ppGTT is
> > > > > completely up to the userspace driver to setup and is mostly only partially
> > > > > equal to the CPU address space.
> > > > > 
> > > > > Specifically as part of reading/writing the ppGTT for debugging purposes,
> > > > > there are deep flushes needed: for example flushing instruction cache
> > > > > when adding/removing breakpoints.
> > > > > 
> > > > > Maybe that will explain the background. I elaborate on this at the end some more.
> > > > > 
> > > > > >               kmap/vmap are used everywhere in the DRM subsystem to access BOs, so I’m
> > > > > >               failing to see the problem with adding a simple helper based on existing
> > > > > >               code.
> > > > > > 
> > > > > >           What#s possible and often done is to do kmap/vmap if you need to implement a
> > > > > >           CPU copy for scanout for example or for copying/validating command buffers.
> > > > > >           But that usually requires accessing the whole BO and has separate security
> > > > > >           checks.
> > > > > > 
> > > > > >           When you want to access only a few bytes of a BO that sounds massively like
> > > > > >           a peek/poke like interface and we have already rejected that more than once.
> > > > > >           There even used to be standardized GEM IOCTLs for that which have been
> > > > > >           removed by now.
> > > > > Referring to the explanation at top: These IOCTL are not for the debugging target
> > > > > process to issue. The peek/poke interface is specifically for GDB only
> > > > > to facilitate the emulation of memory reads/writes on the GPU address
> > > > > space as they were done by EUs themselves. And to recap: for modifying
> > > > > instructions for example (add/remove breakpoint), extra level of cache flushing is
> > > > > needed which is not available to regular userspace.
> > > > > 
> > > > > I specifically discussed with Sima on the difference before moving forward with this
> > > > > design originally. If something has changed since then, I'm of course happy to rediscuss.
> > > > > 
> > > > > However, if this code can't be added, not sure how we would ever be able
> > > > > to implement core dumps for GPU threads/memory?
> > > > > 
> > > > > >           If you need to access BOs which are placed in not CPU accessible memory then
> > > > > >           implement the access callback for ptrace, see amdgpu_ttm_access_memory for
> > > > > >           an example how to do this.
> > > > > As also mentioned above, we don't work via ptrace at all when it comes
> > > > > to debugging the EUs. The only thing used for now is the ptrace_may_access to
> > > > > implement similar access restrictions as ptrace has. This can be changed
> > > > > to something else if needed.
> > > > > 
> > > > > >       Ptrace access via vm_operations_struct.access → ttm_bo_vm_access.
> > > > > > 
> > > > > >       This series renames ttm_bo_vm_access to ttm_bo_access, with no code changes.
> > > > > > 
> > > > > >       The above function accesses a BO via kmap if it is in SYSTEM / TT,
> > > > > >       which is existing code.
> > > > > > 
> > > > > >       This function is only exposed to user space via ptrace permissions.
> > > > > Maybe this sentence is what caused the confusion.
> > > > > 
> > > > > Userspace is never exposed with peek/poke interface, only the debugger
> > > > > connection which is its own FD.
> > > > > 
> > > > > >       In this series, we implement a function [3] similar to
> > > > > >       amdgpu_ttm_access_memory for the TTM vfunc access_memory. What is
> > > > > >       missing is non-visible CPU memory access, similar to
> > > > > >       amdgpu_ttm_access_memory_sdma. This will be addressed in a follow-up and
> > > > > >       was omitted in this series given its complexity.
> > > > > > 
> > > > > >       So, this looks more or less identical to AMD's ptrace implementation,
> > > > > >       but in GPU address space. Again, I fail to see what the problem is here.
> > > > > >       What am I missing?
> > > > > > 
> > > > > > 
> > > > > > The main question is why can't you use the existing interfaces directly?
> > > > > We're not working on the CPU address space or BOs. We're working
> > > > > strictly on the GPU address space as would be seen by an EU thread if it
> > > > > accessed address X.
> > > > > 
> > > > > > Additional to the peek/poke interface of ptrace Linux has the pidfd_getfd
> > > > > > system call, see herehttps://man7.org/linux/man-pages/man2/pidfd_getfd.2.html.
> > > > > > 
> > > > > > The pidfd_getfd() allows to dup() the render node file descriptor into your gdb
> > > > > > process. That in turn gives you all the access you need from gdb, including
> > > > > > mapping BOs and command submission on behalf of the application.
> > > > > We're not operating on the CPU address space nor are we operating on BOs
> > > > > (there is no concept of BO in the EU debug interface). Each VMA in the VM
> > > > > could come from anywhere, only the start address and size matter. And
> > > > > neither do we need to interfere with the command submission of the
> > > > > process under debug.
> > > > > 
> > > > > > As far as I can see that allows for the same functionality as the eudebug
> > > > > > interface, just without any driver specific code messing with ptrace
> > > > > > permissions and peek/poke interfaces.
> > > > > > 
> > > > > > So the question is still why do you need the whole eudebug interface in the
> > > > > > first place? I might be missing something, but that seems to be superfluous
> > > > > > from a high level view.
> > > > > Recapping from above. It is to allow the debugging of EU threads per DRM
> > > > > client, completely independent of the CPU process. If ptrace_may_acces
> > > > > is the sore point, we could consider other permission checks, too. There
> > > > > is no other connection to ptrace in this architecture as single
> > > > > permission check to know if PID is fair game to access by debugger
> > > > > process.
> > > > > 
> > > > > Why no parasitic thread or ptrace: Going forward, binding the EU debugging to
> > > > > the DRM client would also pave way for being able to extend core kernel generated
> > > > > core dump with each DRM client's EU thread/memory dump. We have similar
> > > > > feature called "Offline core dump" enabled in the downstream public
> > > > > trees for i915, where we currently attach the EU thread dump to i915 error state
> > > > > and then later combine i915 error state with CPU core dump file with a
> > > > > tool.
> > > > > 
> > > > > This is relatively little amount of extra code, as this baseline series
> > > > > already introduces GDB the ability to perform the necessary actions.
> > > > > It's just the matter of kernel driver calling: "stop all threads", then
> > > > > copying the memory map and memory contents for GPU threads, just like is
> > > > > done for CPU threads.
> > > > > 
> > > > > With parasitic thread injection, not sure if there is such way forward,
> > > > > as it would seem to require to inject quite abit more logic to core kernel?
> > > > > 
> > > > > > It's true that the AMD KFD part has still similar functionality, but that is
> > > > > > because of the broken KFD design of tying driver state to the CPU process
> > > > > > (which makes it inaccessible for gdb even with imported render node fd).
> > > > > > 
> > > > > > Both Sima and I (and partially Dave as well) have pushed back on the KFD
> > > > > > approach. And the long term plan is to get rid of such device driver specific
> > > > > > interface which re-implement existing functionality just differently.
> > > > > Recapping, this series is not adding it back. The debugger connection
> > > > > is a separate FD from the DRM one, with separate IOCTL set. We don't allow
> > > > > the DRM FD any new operations based on ptrace is attached or not. We
> > > > > don't ever do that check even.
> > > > > 
> > > > > We only restrict the opening of the debugger connection to given PID with
> > > > > ptrace_may_access check for now. That can be changed to something else,
> > > > > if necessary.
> > > > Yeah I think unnecessarily tying gpu processes to cpu processes is a bad
> > > > thing, least because even today all the svm discussions we have still hit
> > > > clear use-cases, where a 1:1 match is not wanted (like multiple gpu svm
> > > > sections with offsets). Not even speaking of all the gpu usecases where
> > > > the gpu vm space is still entirely independent of the cpu side.
> > > > 
> > > > So that's why I think this entirely separate approach looks like the right
> > > > one, with ptrace_may_access as the access control check to make sure we
> > > > match ptrace on the cpu side.
> > > > 
> > > > But there's very obviously a bikeshed to be had on what the actual uapi
> > > > should look like, especially how gdb opens up a gpu debug access fd. But I
> > > > also think that's not much on drm to decide, but whatever gdb wants. And
> > > > then we aim for some consistency on that lookup/access control part
> > > > (ideally, I might be missing some reasons why this is a bad idea) across
> > > > drm drivers.
> > > > 
> > > > > > So you need to have a really really good explanation why the eudebug interface
> > > > > > is actually necessary.
> > > > > TL;DR The main point is to decouple the debugging of the EU workloads from the
> > > > > debugging of the CPU process. This avoids the interference with the CPU process with
> > > > > parasitic thread injection. Further this also allows generating a core dump
> > > > > without any GDB connected. There are also many other smaller pros/cons
> > > > > which can be discussed but for the context of this patch, this is the
> > > > > main one.
> > > > > 
> > > > > So unlike parasitic thread injection, we don't unlock any special IOCTL for
> > > > > the process under debug to be performed by the parasitic thread, but we
> > > > > allow the minimal set of operations to be performed by GDB as if those were
> > > > > done on the EUs themselves.
> > > > > 
> > > > > One can think of it like the minimal subset of ptrace but for EU threads,
> > > > > not the CPU threads. And thus, building on this it's possible to extend
> > > > > the core kernel generated core dumps with DRM specific extension which
> > > > > would contain the EU thread/memory dump.
> > > > It might be good to document (in that debugging doc patch probably) why
> > > > thread injection is not a great option, and why the tradeoffs for
> > > > debugging are different than for for checkpoint/restore, where with CRIU
> > > > we landed on doing most of this in userspace, and often requiring
> > > > injection threads to make it all work.
> > > > 
> > > > Cheers, Sima
> > > > 
> > > > > Regards, Joonas
> > > > > 
> > > > > > Regards,
> > > > > > Christian.
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > >       Matt
> > > > > > 
> > > > > >       [3]https://patchwork.freedesktop.org/patch/622520/?series=140200&rev=6
> > > > > > 
> > > > > > 
> > > > > >           Regards,
> > > > > >           Christian.
> > > > > > 
> > > > > > 
> > > > > >               Matt
> > > > > > 
> > > > > > 
> > > > > >                   Regards,
> > > > > >                   Christian.
> > > > > >