On Mon, Nov 11, 2024 at 04:54:57PM +0100, Christian König wrote: > Am 11.11.24 um 15:00 schrieb Joonas Lahtinen: > > Quoting Christian König (2024-11-11 13:34:12) > > > Am 11.11.24 um 11:10 schrieb Simona Vetter: > > > > On Mon, Nov 11, 2024 at 10:00:17AM +0200, Joonas Lahtinen wrote: > > > > > Back from some time off and will try to answer below. > > > > > > > > > > Adding Dave and Sima as this topic has been previously discussed to some > > > > > extent and will be good to reach common understanding about what the > > > > > series is trying to do and what is the difference to the AMD debugging > > > > > model. > > > > I chatted about this thread a bit on irc with folks, and I think an > > > > orthogonal issue is the question, what should be in ttm-utils? I've asked > > > > Matt to type up a DOC patch once we have some consensus, since imo the > > > > somewhat lackluster documentation situation for ttm is also somewhat a > > > > cause for these big threads on various different topics. Aside from the > > > > fact that gpu memory management is just hard. > > > > > > > > On the uapi/design aspect, I think this would serve well with a patch to > > > > drm-uapi.rst that adds a debugging section? At least once we have some > > > > rough consensus across drivers, and more importantly userspace in the form > > > > of gdb upstream (at least I'm not aware of any other upstream debugger > > > > patches, I think amd's rocm stuff is also gdb-only). > > > Yeah that seems to be a really good idea. Similar design ideas came up > > > AMD internally as well but where dropped after pointing people to > > > pidfd_getfd(). > > > > > > But the bigger problem seems to be that the design doesn't seems to take > > > the dma_fence requirements into account. > > Where would you deduce that? > > XE is based on a preemption fence based memory management. > > > We specifically limit the debugging to Long Running contexts which don't > > depend on dma_fences. > > That doesn't matter. > > As long as you don't have a page fault (HMM) based memory management you > still have that inter dependency and at least the public available XE code > doesn't seem to have that. > > > > In other words attaching gdb to a pid seems to stop the GPU thread of > > > this pid without waiting for the XE preemption nor end of operation fence. > > > > > > I mean if the GPU threads are preempted that could work, but yeah not > > > like this :) > > For us, hitting a breakpoint inside the workload would always violate > > any dma_fence timeout for the submitted workload, as the HW context can't > > be switched out while in the breakpoint. > > That is clearly *not* something you can do without changing your memory > management. > > > For any dma_fence workload the guarantee is that that it completes > > within reasonable time after submission (guaranteed by the submitter). I > > don't see how you could really allow interactive debugging of a > > breakpoint under those restrictions anyway even if pre-emption was > > supported as the workload would not finish in <10 seconds? > > Yeah that is the whole point, this is impossible as far as we know. > > > For i915 we did have the "pre-emptable but indefinitely long dma_fence workloads" > > concept at one point and that was rejected after the lengthy discussion. > > I mean I tried that more than once myself and we have multiple requests for > this on the AMD side from customers. So far nobody came up with a solution > which actually works correctly. > > > So I think only way to allow interactive debugging is to avoid the > > dma_fences. Curious to hear if there are ideas for otherwise. > > You need to guarantee somehow that the process is taken from the hardware so > that the preemption fence can signal. > Our preemption fences have this functionality. A preemption fence issues a suspend execution command to the firmware. The firmware, in turn, attempts to preempt the workload. If it doesn't respond within a specified period, it resets the hardware queue, sends a message to KMD, bans the software queue, and signals the preemption fence. We provide even more protection than that. If, for some reason, the firmware doesn't respond within a longer timeout period, the KMD performs a device reset, ban the offending software queue(s), and will signal the preemption fences. This flow remains the same whether a debugger is attached or, for example, a user submits a 10-minute non-preemptable workload. In either case, other processes are guaranteed to make forward progress. The example above illustrates the memory oversubscription case, where two processes are using 51% of the memory. Another preemption scenario involves two processes sharing hardware resources. Our firmware follows the same flow here. If an LR workload is using a hardware resource and a DMA-fence workload is waiting, and if the LR workload doesn't preempt the in a timely manner, the firmware issues a hardware reset, notifies KMD, and bans the LR software queue. The DMA-fence workload then can make forward progress With the above in mind, this is why I say that if a user tries to run a game and a non-preemptable LR workload, either oversubscribing memory or sharing hardware resources, it is unlikely to work well. However, I don't think this is a common use case. I would expect that when a debugger is open, it is typically by a power user who knows how to disable other GPU tasks (e.g., by enabling software rendering or using a machine without any display). Given this, please to reconsider your position. > This means that a breakpoint or core dump doesn't halt GPU threads, but > rather suspends them. E.g. all running wave data is collected into a state > bag which can be restored later on. > > I was under the impression that those long running compute threads do > exactly that, but when the hardware can't switch out the GPU thread/process > while in a break then that isn't the case. > > As long as you don't find a way to avoid that this patch set is a pretty > clear NAK from my side as DMA-buf and TTM maintainer. > I believe this is addressed above. Matt > What might work is to keep the submission on the hardware in the break state > but forbid any memory access. This way you can signal your preemption fence > even when the hardware isn't made available. > > Before you continue XE setups a new pre-emption fence and makes sure that > all page tables etc... are up to date. > > Could be tricky to get this right if completion fence based submissions are > mixed in as well, but that gives you at least a direction you could > potentially go. > > Regards, > Christian. > > > > > Regards, Joonas > > > > > Regards, > > > Christian. > > > > > > > Some wash-up thoughts from me below, but consider them fairly irrelevant > > > > since I think the main driver for these big questions here should be > > > > gdb/userspace. > > > > > > > > > Quoting Christian König (2024-11-07 11:44:33) > > > > > > Am 06.11.24 um 18:00 schrieb Matthew Brost: > > > > > > > > > > > > [SNIP] > > > > > > > > > > > > This is not a generic interface that anyone can freely access. The same > > > > > > permissions used by ptrace are checked when opening such an interface. > > > > > > See [1] [2]. > > > > > > > > > > > > [1]https://patchwork.freedesktop.org/patch/617470/?series=136572&rev=2 > > > > > > [2]https://patchwork.freedesktop.org/patch/617471/?series=136572&rev=2 > > > > > > > > > > > > > > > > > > Thanks a lot for those pointers, that is exactly what I was looking for. > > > > > > > > > > > > And yeah, it is what I feared. You are re-implementing existing functionality, > > > > > > but see below. > > > > > Could you elaborate on what this "existing functionality" exactly is? > > > > > I do not think this functionality exists at this time. > > > > > > > > > > The EU debugging architecture for Xe specifically avoids the need for GDB > > > > > to attach with ptrace to the CPU process or interfere with the CPU process for > > > > > the debugging via parasitic threads or so. > > > > > > > > > > Debugger connection is opened to the DRM driver for given PID (which uses the > > > > > ptrace may access check for now) after which the all DRM client of that > > > > > PID are exposed to the debugger process. > > > > > > > > > > What we want to expose via that debugger connection is the ability for GDB to > > > > > read/write the different GPU VM address spaces (ppGTT for Intel GPUs) just like > > > > > the EU threads would see them. Note that the layout of the ppGTT is > > > > > completely up to the userspace driver to setup and is mostly only partially > > > > > equal to the CPU address space. > > > > > > > > > > Specifically as part of reading/writing the ppGTT for debugging purposes, > > > > > there are deep flushes needed: for example flushing instruction cache > > > > > when adding/removing breakpoints. > > > > > > > > > > Maybe that will explain the background. I elaborate on this at the end some more. > > > > > > > > > > > kmap/vmap are used everywhere in the DRM subsystem to access BOs, so I’m > > > > > > failing to see the problem with adding a simple helper based on existing > > > > > > code. > > > > > > > > > > > > What#s possible and often done is to do kmap/vmap if you need to implement a > > > > > > CPU copy for scanout for example or for copying/validating command buffers. > > > > > > But that usually requires accessing the whole BO and has separate security > > > > > > checks. > > > > > > > > > > > > When you want to access only a few bytes of a BO that sounds massively like > > > > > > a peek/poke like interface and we have already rejected that more than once. > > > > > > There even used to be standardized GEM IOCTLs for that which have been > > > > > > removed by now. > > > > > Referring to the explanation at top: These IOCTL are not for the debugging target > > > > > process to issue. The peek/poke interface is specifically for GDB only > > > > > to facilitate the emulation of memory reads/writes on the GPU address > > > > > space as they were done by EUs themselves. And to recap: for modifying > > > > > instructions for example (add/remove breakpoint), extra level of cache flushing is > > > > > needed which is not available to regular userspace. > > > > > > > > > > I specifically discussed with Sima on the difference before moving forward with this > > > > > design originally. If something has changed since then, I'm of course happy to rediscuss. > > > > > > > > > > However, if this code can't be added, not sure how we would ever be able > > > > > to implement core dumps for GPU threads/memory? > > > > > > > > > > > If you need to access BOs which are placed in not CPU accessible memory then > > > > > > implement the access callback for ptrace, see amdgpu_ttm_access_memory for > > > > > > an example how to do this. > > > > > As also mentioned above, we don't work via ptrace at all when it comes > > > > > to debugging the EUs. The only thing used for now is the ptrace_may_access to > > > > > implement similar access restrictions as ptrace has. This can be changed > > > > > to something else if needed. > > > > > > > > > > > Ptrace access via vm_operations_struct.access → ttm_bo_vm_access. > > > > > > > > > > > > This series renames ttm_bo_vm_access to ttm_bo_access, with no code changes. > > > > > > > > > > > > The above function accesses a BO via kmap if it is in SYSTEM / TT, > > > > > > which is existing code. > > > > > > > > > > > > This function is only exposed to user space via ptrace permissions. > > > > > Maybe this sentence is what caused the confusion. > > > > > > > > > > Userspace is never exposed with peek/poke interface, only the debugger > > > > > connection which is its own FD. > > > > > > > > > > > In this series, we implement a function [3] similar to > > > > > > amdgpu_ttm_access_memory for the TTM vfunc access_memory. What is > > > > > > missing is non-visible CPU memory access, similar to > > > > > > amdgpu_ttm_access_memory_sdma. This will be addressed in a follow-up and > > > > > > was omitted in this series given its complexity. > > > > > > > > > > > > So, this looks more or less identical to AMD's ptrace implementation, > > > > > > but in GPU address space. Again, I fail to see what the problem is here. > > > > > > What am I missing? > > > > > > > > > > > > > > > > > > The main question is why can't you use the existing interfaces directly? > > > > > We're not working on the CPU address space or BOs. We're working > > > > > strictly on the GPU address space as would be seen by an EU thread if it > > > > > accessed address X. > > > > > > > > > > > Additional to the peek/poke interface of ptrace Linux has the pidfd_getfd > > > > > > system call, see herehttps://man7.org/linux/man-pages/man2/pidfd_getfd.2.html. > > > > > > > > > > > > The pidfd_getfd() allows to dup() the render node file descriptor into your gdb > > > > > > process. That in turn gives you all the access you need from gdb, including > > > > > > mapping BOs and command submission on behalf of the application. > > > > > We're not operating on the CPU address space nor are we operating on BOs > > > > > (there is no concept of BO in the EU debug interface). Each VMA in the VM > > > > > could come from anywhere, only the start address and size matter. And > > > > > neither do we need to interfere with the command submission of the > > > > > process under debug. > > > > > > > > > > > As far as I can see that allows for the same functionality as the eudebug > > > > > > interface, just without any driver specific code messing with ptrace > > > > > > permissions and peek/poke interfaces. > > > > > > > > > > > > So the question is still why do you need the whole eudebug interface in the > > > > > > first place? I might be missing something, but that seems to be superfluous > > > > > > from a high level view. > > > > > Recapping from above. It is to allow the debugging of EU threads per DRM > > > > > client, completely independent of the CPU process. If ptrace_may_acces > > > > > is the sore point, we could consider other permission checks, too. There > > > > > is no other connection to ptrace in this architecture as single > > > > > permission check to know if PID is fair game to access by debugger > > > > > process. > > > > > > > > > > Why no parasitic thread or ptrace: Going forward, binding the EU debugging to > > > > > the DRM client would also pave way for being able to extend core kernel generated > > > > > core dump with each DRM client's EU thread/memory dump. We have similar > > > > > feature called "Offline core dump" enabled in the downstream public > > > > > trees for i915, where we currently attach the EU thread dump to i915 error state > > > > > and then later combine i915 error state with CPU core dump file with a > > > > > tool. > > > > > > > > > > This is relatively little amount of extra code, as this baseline series > > > > > already introduces GDB the ability to perform the necessary actions. > > > > > It's just the matter of kernel driver calling: "stop all threads", then > > > > > copying the memory map and memory contents for GPU threads, just like is > > > > > done for CPU threads. > > > > > > > > > > With parasitic thread injection, not sure if there is such way forward, > > > > > as it would seem to require to inject quite abit more logic to core kernel? > > > > > > > > > > > It's true that the AMD KFD part has still similar functionality, but that is > > > > > > because of the broken KFD design of tying driver state to the CPU process > > > > > > (which makes it inaccessible for gdb even with imported render node fd). > > > > > > > > > > > > Both Sima and I (and partially Dave as well) have pushed back on the KFD > > > > > > approach. And the long term plan is to get rid of such device driver specific > > > > > > interface which re-implement existing functionality just differently. > > > > > Recapping, this series is not adding it back. The debugger connection > > > > > is a separate FD from the DRM one, with separate IOCTL set. We don't allow > > > > > the DRM FD any new operations based on ptrace is attached or not. We > > > > > don't ever do that check even. > > > > > > > > > > We only restrict the opening of the debugger connection to given PID with > > > > > ptrace_may_access check for now. That can be changed to something else, > > > > > if necessary. > > > > Yeah I think unnecessarily tying gpu processes to cpu processes is a bad > > > > thing, least because even today all the svm discussions we have still hit > > > > clear use-cases, where a 1:1 match is not wanted (like multiple gpu svm > > > > sections with offsets). Not even speaking of all the gpu usecases where > > > > the gpu vm space is still entirely independent of the cpu side. > > > > > > > > So that's why I think this entirely separate approach looks like the right > > > > one, with ptrace_may_access as the access control check to make sure we > > > > match ptrace on the cpu side. > > > > > > > > But there's very obviously a bikeshed to be had on what the actual uapi > > > > should look like, especially how gdb opens up a gpu debug access fd. But I > > > > also think that's not much on drm to decide, but whatever gdb wants. And > > > > then we aim for some consistency on that lookup/access control part > > > > (ideally, I might be missing some reasons why this is a bad idea) across > > > > drm drivers. > > > > > > > > > > So you need to have a really really good explanation why the eudebug interface > > > > > > is actually necessary. > > > > > TL;DR The main point is to decouple the debugging of the EU workloads from the > > > > > debugging of the CPU process. This avoids the interference with the CPU process with > > > > > parasitic thread injection. Further this also allows generating a core dump > > > > > without any GDB connected. There are also many other smaller pros/cons > > > > > which can be discussed but for the context of this patch, this is the > > > > > main one. > > > > > > > > > > So unlike parasitic thread injection, we don't unlock any special IOCTL for > > > > > the process under debug to be performed by the parasitic thread, but we > > > > > allow the minimal set of operations to be performed by GDB as if those were > > > > > done on the EUs themselves. > > > > > > > > > > One can think of it like the minimal subset of ptrace but for EU threads, > > > > > not the CPU threads. And thus, building on this it's possible to extend > > > > > the core kernel generated core dumps with DRM specific extension which > > > > > would contain the EU thread/memory dump. > > > > It might be good to document (in that debugging doc patch probably) why > > > > thread injection is not a great option, and why the tradeoffs for > > > > debugging are different than for for checkpoint/restore, where with CRIU > > > > we landed on doing most of this in userspace, and often requiring > > > > injection threads to make it all work. > > > > > > > > Cheers, Sima > > > > > > > > > Regards, Joonas > > > > > > > > > > > Regards, > > > > > > Christian. > > > > > > > > > > > > > > > > > > > > > > > > Matt > > > > > > > > > > > > [3]https://patchwork.freedesktop.org/patch/622520/?series=140200&rev=6 > > > > > > > > > > > > > > > > > > Regards, > > > > > > Christian. > > > > > > > > > > > > > > > > > > Matt > > > > > > > > > > > > > > > > > > Regards, > > > > > > Christian. > > > > > >