Quoting Christian König (2024-11-11 13:34:12) > Am 11.11.24 um 11:10 schrieb Simona Vetter: > > On Mon, Nov 11, 2024 at 10:00:17AM +0200, Joonas Lahtinen wrote: > >> Back from some time off and will try to answer below. > >> > >> Adding Dave and Sima as this topic has been previously discussed to some > >> extent and will be good to reach common understanding about what the > >> series is trying to do and what is the difference to the AMD debugging > >> model. > > I chatted about this thread a bit on irc with folks, and I think an > > orthogonal issue is the question, what should be in ttm-utils? I've asked > > Matt to type up a DOC patch once we have some consensus, since imo the > > somewhat lackluster documentation situation for ttm is also somewhat a > > cause for these big threads on various different topics. Aside from the > > fact that gpu memory management is just hard. > > > > On the uapi/design aspect, I think this would serve well with a patch to > > drm-uapi.rst that adds a debugging section? At least once we have some > > rough consensus across drivers, and more importantly userspace in the form > > of gdb upstream (at least I'm not aware of any other upstream debugger > > patches, I think amd's rocm stuff is also gdb-only). > > Yeah that seems to be a really good idea. Similar design ideas came up > AMD internally as well but where dropped after pointing people to > pidfd_getfd(). > > But the bigger problem seems to be that the design doesn't seems to take > the dma_fence requirements into account. Where would you deduce that? We specifically limit the debugging to Long Running contexts which don't depend on dma_fences. > In other words attaching gdb to a pid seems to stop the GPU thread of > this pid without waiting for the XE preemption nor end of operation fence. > > I mean if the GPU threads are preempted that could work, but yeah not > like this :) For us, hitting a breakpoint inside the workload would always violate any dma_fence timeout for the submitted workload, as the HW context can't be switched out while in the breakpoint. For any dma_fence workload the guarantee is that that it completes within reasonable time after submission (guaranteed by the submitter). I don't see how you could really allow interactive debugging of a breakpoint under those restrictions anyway even if pre-emption was supported as the workload would not finish in <10 seconds? For i915 we did have the "pre-emptable but indefinitely long dma_fence workloads" concept at one point and that was rejected after the lengthy discussion. So I think only way to allow interactive debugging is to avoid the dma_fences. Curious to hear if there are ideas for otherwise. Regards, Joonas > > Regards, > Christian. > > > > > Some wash-up thoughts from me below, but consider them fairly irrelevant > > since I think the main driver for these big questions here should be > > gdb/userspace. > > > >> Quoting Christian König (2024-11-07 11:44:33) > >>> Am 06.11.24 um 18:00 schrieb Matthew Brost: > >>> > >>> [SNIP] > >>> > >>> This is not a generic interface that anyone can freely access. The same > >>> permissions used by ptrace are checked when opening such an interface. > >>> See [1] [2]. > >>> > >>> [1] https://patchwork.freedesktop.org/patch/617470/?series=136572&rev=2 > >>> [2] https://patchwork.freedesktop.org/patch/617471/?series=136572&rev=2 > >>> > >>> > >>> Thanks a lot for those pointers, that is exactly what I was looking for. > >>> > >>> And yeah, it is what I feared. You are re-implementing existing functionality, > >>> but see below. > >> Could you elaborate on what this "existing functionality" exactly is? > >> I do not think this functionality exists at this time. > >> > >> The EU debugging architecture for Xe specifically avoids the need for GDB > >> to attach with ptrace to the CPU process or interfere with the CPU process for > >> the debugging via parasitic threads or so. > >> > >> Debugger connection is opened to the DRM driver for given PID (which uses the > >> ptrace may access check for now) after which the all DRM client of that > >> PID are exposed to the debugger process. > >> > >> What we want to expose via that debugger connection is the ability for GDB to > >> read/write the different GPU VM address spaces (ppGTT for Intel GPUs) just like > >> the EU threads would see them. Note that the layout of the ppGTT is > >> completely up to the userspace driver to setup and is mostly only partially > >> equal to the CPU address space. > >> > >> Specifically as part of reading/writing the ppGTT for debugging purposes, > >> there are deep flushes needed: for example flushing instruction cache > >> when adding/removing breakpoints. > >> > >> Maybe that will explain the background. I elaborate on this at the end some more. > >> > >>> kmap/vmap are used everywhere in the DRM subsystem to access BOs, so I’m > >>> failing to see the problem with adding a simple helper based on existing > >>> code. > >>> > >>> What#s possible and often done is to do kmap/vmap if you need to implement a > >>> CPU copy for scanout for example or for copying/validating command buffers. > >>> But that usually requires accessing the whole BO and has separate security > >>> checks. > >>> > >>> When you want to access only a few bytes of a BO that sounds massively like > >>> a peek/poke like interface and we have already rejected that more than once. > >>> There even used to be standardized GEM IOCTLs for that which have been > >>> removed by now. > >> Referring to the explanation at top: These IOCTL are not for the debugging target > >> process to issue. The peek/poke interface is specifically for GDB only > >> to facilitate the emulation of memory reads/writes on the GPU address > >> space as they were done by EUs themselves. And to recap: for modifying > >> instructions for example (add/remove breakpoint), extra level of cache flushing is > >> needed which is not available to regular userspace. > >> > >> I specifically discussed with Sima on the difference before moving forward with this > >> design originally. If something has changed since then, I'm of course happy to rediscuss. > >> > >> However, if this code can't be added, not sure how we would ever be able > >> to implement core dumps for GPU threads/memory? > >> > >>> If you need to access BOs which are placed in not CPU accessible memory then > >>> implement the access callback for ptrace, see amdgpu_ttm_access_memory for > >>> an example how to do this. > >> As also mentioned above, we don't work via ptrace at all when it comes > >> to debugging the EUs. The only thing used for now is the ptrace_may_access to > >> implement similar access restrictions as ptrace has. This can be changed > >> to something else if needed. > >> > >>> Ptrace access via vm_operations_struct.access → ttm_bo_vm_access. > >>> > >>> This series renames ttm_bo_vm_access to ttm_bo_access, with no code changes. > >>> > >>> The above function accesses a BO via kmap if it is in SYSTEM / TT, > >>> which is existing code. > >>> > >>> This function is only exposed to user space via ptrace permissions. > >> Maybe this sentence is what caused the confusion. > >> > >> Userspace is never exposed with peek/poke interface, only the debugger > >> connection which is its own FD. > >> > >>> In this series, we implement a function [3] similar to > >>> amdgpu_ttm_access_memory for the TTM vfunc access_memory. What is > >>> missing is non-visible CPU memory access, similar to > >>> amdgpu_ttm_access_memory_sdma. This will be addressed in a follow-up and > >>> was omitted in this series given its complexity. > >>> > >>> So, this looks more or less identical to AMD's ptrace implementation, > >>> but in GPU address space. Again, I fail to see what the problem is here. > >>> What am I missing? > >>> > >>> > >>> The main question is why can't you use the existing interfaces directly? > >> We're not working on the CPU address space or BOs. We're working > >> strictly on the GPU address space as would be seen by an EU thread if it > >> accessed address X. > >> > >>> Additional to the peek/poke interface of ptrace Linux has the pidfd_getfd > >>> system call, see here https://man7.org/linux/man-pages/man2/pidfd_getfd.2.html. > >>> > >>> The pidfd_getfd() allows to dup() the render node file descriptor into your gdb > >>> process. That in turn gives you all the access you need from gdb, including > >>> mapping BOs and command submission on behalf of the application. > >> We're not operating on the CPU address space nor are we operating on BOs > >> (there is no concept of BO in the EU debug interface). Each VMA in the VM > >> could come from anywhere, only the start address and size matter. And > >> neither do we need to interfere with the command submission of the > >> process under debug. > >> > >>> As far as I can see that allows for the same functionality as the eudebug > >>> interface, just without any driver specific code messing with ptrace > >>> permissions and peek/poke interfaces. > >>> > >>> So the question is still why do you need the whole eudebug interface in the > >>> first place? I might be missing something, but that seems to be superfluous > >>> from a high level view. > >> Recapping from above. It is to allow the debugging of EU threads per DRM > >> client, completely independent of the CPU process. If ptrace_may_acces > >> is the sore point, we could consider other permission checks, too. There > >> is no other connection to ptrace in this architecture as single > >> permission check to know if PID is fair game to access by debugger > >> process. > >> > >> Why no parasitic thread or ptrace: Going forward, binding the EU debugging to > >> the DRM client would also pave way for being able to extend core kernel generated > >> core dump with each DRM client's EU thread/memory dump. We have similar > >> feature called "Offline core dump" enabled in the downstream public > >> trees for i915, where we currently attach the EU thread dump to i915 error state > >> and then later combine i915 error state with CPU core dump file with a > >> tool. > >> > >> This is relatively little amount of extra code, as this baseline series > >> already introduces GDB the ability to perform the necessary actions. > >> It's just the matter of kernel driver calling: "stop all threads", then > >> copying the memory map and memory contents for GPU threads, just like is > >> done for CPU threads. > >> > >> With parasitic thread injection, not sure if there is such way forward, > >> as it would seem to require to inject quite abit more logic to core kernel? > >> > >>> It's true that the AMD KFD part has still similar functionality, but that is > >>> because of the broken KFD design of tying driver state to the CPU process > >>> (which makes it inaccessible for gdb even with imported render node fd). > >>> > >>> Both Sima and I (and partially Dave as well) have pushed back on the KFD > >>> approach. And the long term plan is to get rid of such device driver specific > >>> interface which re-implement existing functionality just differently. > >> Recapping, this series is not adding it back. The debugger connection > >> is a separate FD from the DRM one, with separate IOCTL set. We don't allow > >> the DRM FD any new operations based on ptrace is attached or not. We > >> don't ever do that check even. > >> > >> We only restrict the opening of the debugger connection to given PID with > >> ptrace_may_access check for now. That can be changed to something else, > >> if necessary. > > Yeah I think unnecessarily tying gpu processes to cpu processes is a bad > > thing, least because even today all the svm discussions we have still hit > > clear use-cases, where a 1:1 match is not wanted (like multiple gpu svm > > sections with offsets). Not even speaking of all the gpu usecases where > > the gpu vm space is still entirely independent of the cpu side. > > > > So that's why I think this entirely separate approach looks like the right > > one, with ptrace_may_access as the access control check to make sure we > > match ptrace on the cpu side. > > > > But there's very obviously a bikeshed to be had on what the actual uapi > > should look like, especially how gdb opens up a gpu debug access fd. But I > > also think that's not much on drm to decide, but whatever gdb wants. And > > then we aim for some consistency on that lookup/access control part > > (ideally, I might be missing some reasons why this is a bad idea) across > > drm drivers. > > > >>> So you need to have a really really good explanation why the eudebug interface > >>> is actually necessary. > >> TL;DR The main point is to decouple the debugging of the EU workloads from the > >> debugging of the CPU process. This avoids the interference with the CPU process with > >> parasitic thread injection. Further this also allows generating a core dump > >> without any GDB connected. There are also many other smaller pros/cons > >> which can be discussed but for the context of this patch, this is the > >> main one. > >> > >> So unlike parasitic thread injection, we don't unlock any special IOCTL for > >> the process under debug to be performed by the parasitic thread, but we > >> allow the minimal set of operations to be performed by GDB as if those were > >> done on the EUs themselves. > >> > >> One can think of it like the minimal subset of ptrace but for EU threads, > >> not the CPU threads. And thus, building on this it's possible to extend > >> the core kernel generated core dumps with DRM specific extension which > >> would contain the EU thread/memory dump. > > It might be good to document (in that debugging doc patch probably) why > > thread injection is not a great option, and why the tradeoffs for > > debugging are different than for for checkpoint/restore, where with CRIU > > we landed on doing most of this in userspace, and often requiring > > injection threads to make it all work. > > > > Cheers, Sima > > > >> Regards, Joonas > >> > >>> Regards, > >>> Christian. > >>> > >>> > >>> > >>> Matt > >>> > >>> [3] https://patchwork.freedesktop.org/patch/622520/?series=140200&rev=6 > >>> > >>> > >>> Regards, > >>> Christian. > >>> > >>> > >>> Matt > >>> > >>> > >>> Regards, > >>> Christian. > >>> >