On Tue, 2024-11-12 at 15:41 +0200, Joonas Lahtinen wrote: > (+ Thomas) > > Quoting Christian König (2024-11-12 11:23:36) > > Am 11.11.24 um 23:45 schrieb Matthew Brost: > > > > [SNIP] > > > > So I think only way to allow interactive debugging is > > to avoid the > > dma_fences. Curious to hear if there are ideas for > > otherwise. > > > > You need to guarantee somehow that the process is taken > > from the hardware so > > that the preemption fence can signal. > > > > > > Our preemption fences have this functionality. > > > > A preemption fence issues a suspend execution command to the > > firmware. The > > firmware, in turn, attempts to preempt the workload. If it > > doesn't respond > > within a specified period, it resets the hardware queue, sends > > a message to KMD, > > bans the software queue, and signals the preemption fence. > > > > We provide even more protection than that. If, for some reason, > > the firmware > > doesn't respond within a longer timeout period, the KMD > > performs a device reset, > > ban the offending software queue(s), and will signal the > > preemption fences. > > > > This flow remains the same whether a debugger is attached or, > > for example, a > > user submits a 10-minute non-preemptable workload. In either > > case, other > > processes are guaranteed to make forward progress. > > > > > > Yeah that is pretty much the same argumentation I have heard before > > and it > > turned out to not be working. > > > > > > The example above illustrates the memory oversubscription case, > > where two > > processes are using 51% of the memory. > > > > > > That isn't even necessary. We have seen applications dying just > > because the > > core memory management tried to join back small pages into huge > > pages in an > > userptr. > > > > That the core memory management jumps in and requests that the pre- > > emption > > fence signals can happen all the time. > > Ouch. Does there happen to be a known reproducer for this behavior or > maybe > bug report? > > > You can mitigate that a bit, Fedora for example disables joining > > back small > > pages into huge pages by default for example and we even had people > > suggesting > > to use mprotect() so that userptrs VMAs don't fork() any more > > (which is of > > course completely illegal). > > > > But my long term take away is that you can't block all causes of > > sudden > > requests to let a pre-emption fence signal. > > I think this problem equally applies to the LR-workloads like the EU > debugging ones. > > > Another preemption scenario involves two processes sharing > > hardware resources. > > Our firmware follows the same flow here. If an LR workload is > > using a hardware > > resource and a DMA-fence workload is waiting, and if the LR > > workload doesn't > > preempt the in a timely manner, the firmware issues a hardware > > reset, notifies > > KMD, and bans the LR software queue. The DMA-fence workload > > then can make > > forward progress > > > > With the above in mind, this is why I say that if a user tries > > to run a game and > > a non-preemptable LR workload, either oversubscribing memory or > > sharing hardware > > resources, it is unlikely to work well. However, I don't think > > this is a common > > use case. I would expect that when a debugger is open, it is > > typically by a > > power user who knows how to disable other GPU tasks (e.g., by > > enabling software > > rendering or using a machine without any display). > > > > Given this, please to reconsider your position. > > > > > > The key point here is that this isn't stable, you can do that as a > > tech demo > > but it can always be that debugging an application just randomly > > dies. And > > believe me AMD has tried this to a rather extreme extend as well. > > It's not really only limited to the debuggable applications at all, > the > normal LR workloads are equally impacted as far as I understand. Just > harder to catch the issue with LR-workloads if the pre-emption fence > signaling is sporadic. > > > What you could potentially work is to taint the kernel and make > > sure that this > > function is only available to user who absolutely know what they > > are doing. > > > > But I would say we can only allow that if all other options have > > been exercised > > and doing it like this is really the only option left. > > It sounds like servicing the memory pre-empt fence by stealing the > pages from underneath the workload would be the way to resolve this > issue. > > This has been extensively discussed already, but was expected to > really > only be needed for low-on-memory scenarios. However it now seems like > the need is much earlier due to the random userptr page joining by > core > mm. Just to clarify here: In Long-Running mode with recoverable pagefaults enabled we don't have any preempt-fences, but rather just zap the PTEs pointing to the affected memory and flush TLB. So from a memory resource POW a breakpoint should be safe, and no mmu notifier nor shrinker will be blocked. Nor will there be any jobs with published dma-fences depending on the job blocked either temporarily by a pagefault or long-term by a debugger breakpoint. /Thomas > > If that is done and the memory pre-empt fence is serviced even for > debuggable contexts, do you have further concerns with the presented > approach > from dma-buf and drm/sched perspective? > > Regards, Joonas > > > > > Regards, > > Christian. > > > > > > This means that a breakpoint or core dump doesn't halt GPU > > threads, but > > rather suspends them. E.g. all running wave data is > > collected into a state > > bag which can be restored later on. > > > > I was under the impression that those long running compute > > threads do > > exactly that, but when the hardware can't switch out the > > GPU thread/process > > while in a break then that isn't the case. > > > > As long as you don't find a way to avoid that this patch > > set is a pretty > > clear NAK from my side as DMA-buf and TTM maintainer. > > > > > > I believe this is addressed above. > > > > Matt > > > > > > What might work is to keep the submission on the hardware > > in the break state > > but forbid any memory access. This way you can signal your > > preemption fence > > even when the hardware isn't made available. > > > > Before you continue XE setups a new pre-emption fence and > > makes sure that > > all page tables etc... are up to date. > > > > Could be tricky to get this right if completion fence based > > submissions are > > mixed in as well, but that gives you at least a direction > > you could > > potentially go. > > > > Regards, > > Christian. > > > > > > Regards, Joonas > > > > > > Regards, > > Christian. > > > > > > Some wash-up thoughts from me below, but > > consider them fairly irrelevant > > since I think the main driver for these big > > questions here should be > > gdb/userspace. > > > > > > Quoting Christian König (2024-11-07 > > 11:44:33) > > > > Am 06.11.24 um 18:00 schrieb Matthew > > Brost: > > > > [SNIP] > > > > This is not a generic interface > > that anyone can freely access. The same > > permissions used by ptrace are > > checked when opening such an interface. > > See [1] [2]. > > > > > > [1]https://patchwork.freedesktop.org/patch/617470/?series=136572&re > > v=2 > > > > [2]https://patchwork.freedesktop.org/patch/617471/?series=136572&re > > v=2 > > > > > > Thanks a lot for those pointers, that > > is exactly what I was looking for. > > > > And yeah, it is what I feared. You are > > re-implementing existing functionality, > > but see below. > > > > Could you elaborate on what this "existing > > functionality" exactly is? > > I do not think this functionality exists at > > this time. > > > > The EU debugging architecture for Xe > > specifically avoids the need for GDB > > to attach with ptrace to the CPU process or > > interfere with the CPU process for > > the debugging via parasitic threads or so. > > > > Debugger connection is opened to the DRM > > driver for given PID (which uses the > > ptrace may access check for now) after > > which the all DRM client of that > > PID are exposed to the debugger process. > > > > What we want to expose via that debugger > > connection is the ability for GDB to > > read/write the different GPU VM address > > spaces (ppGTT for Intel GPUs) just like > > the EU threads would see them. Note that > > the layout of the ppGTT is > > completely up to the userspace driver to > > setup and is mostly only partially > > equal to the CPU address space. > > > > Specifically as part of reading/writing the > > ppGTT for debugging purposes, > > there are deep flushes needed: for example > > flushing instruction cache > > when adding/removing breakpoints. > > > > Maybe that will explain the background. I > > elaborate on this at the end some more. > > > > > > kmap/vmap are used > > everywhere in the DRM subsystem to access BOs, so I’m > > failing to see the > > problem with adding a simple helper based on existing > > code. > > > > What#s possible and often > > done is to do kmap/vmap if you need to implement a > > CPU copy for scanout for > > example or for copying/validating command buffers. > > But that usually requires > > accessing the whole BO and has separate security > > checks. > > > > When you want to access only > > a few bytes of a BO that sounds massively like > > a peek/poke like interface > > and we have already rejected that more than once. > > There even used to be > > standardized GEM IOCTLs for that which have been > > removed by now. > > > > Referring to the explanation at top: These > > IOCTL are not for the debugging target > > process to issue. The peek/poke interface > > is specifically for GDB only > > to facilitate the emulation of memory > > reads/writes on the GPU address > > space as they were done by EUs themselves. > > And to recap: for modifying > > instructions for example (add/remove > > breakpoint), extra level of cache flushing is > > needed which is not available to regular > > userspace. > > > > I specifically discussed with Sima on the > > difference before moving forward with this > > design originally. If something has changed > > since then, I'm of course happy to rediscuss. > > > > However, if this code can't be added, not > > sure how we would ever be able > > to implement core dumps for GPU > > threads/memory? > > > > > > If you need to access BOs > > which are placed in not CPU accessible memory then > > implement the access callback > > for ptrace, see amdgpu_ttm_access_memory for > > an example how to do this. > > > > As also mentioned above, we don't work via > > ptrace at all when it comes > > to debugging the EUs. The only thing used > > for now is the ptrace_may_access to > > implement similar access restrictions as > > ptrace has. This can be changed > > to something else if needed. > > > > > > Ptrace access via > > vm_operations_struct.access → ttm_bo_vm_access. > > > > This series renames > > ttm_bo_vm_access to ttm_bo_access, with no code changes. > > > > The above function accesses a BO > > via kmap if it is in SYSTEM / TT, > > which is existing code. > > > > This function is only exposed to > > user space via ptrace permissions. > > > > Maybe this sentence is what caused the > > confusion. > > > > Userspace is never exposed with peek/poke > > interface, only the debugger > > connection which is its own FD. > > > > > > In this series, we implement a > > function [3] similar to > > amdgpu_ttm_access_memory for the > > TTM vfunc access_memory. What is > > missing is non-visible CPU memory > > access, similar to > > amdgpu_ttm_access_memory_sdma. > > This will be addressed in a follow-up and > > was omitted in this series given > > its complexity. > > > > So, this looks more or less > > identical to AMD's ptrace implementation, > > but in GPU address space. Again, > > I fail to see what the problem is here. > > What am I missing? > > > > > > The main question is why can't you use > > the existing interfaces directly? > > > > We're not working on the CPU address space > > or BOs. We're working > > strictly on the GPU address space as would > > be seen by an EU thread if it > > accessed address X. > > > > > > Additional to the peek/poke interface > > of ptrace Linux has the pidfd_getfd > > system call, see > > herehttps://man7.org/linux/man-pages/man2/pidfd_getfd.2.html. > > > > The pidfd_getfd() allows to dup() the > > render node file descriptor into your gdb > > process. That in turn gives you all the > > access you need from gdb, including > > mapping BOs and command submission on > > behalf of the application. > > > > We're not operating on the CPU address > > space nor are we operating on BOs > > (there is no concept of BO in the EU debug > > interface). Each VMA in the VM > > could come from anywhere, only the start > > address and size matter. And > > neither do we need to interfere with the > > command submission of the > > process under debug. > > > > > > As far as I can see that allows for the > > same functionality as the eudebug > > interface, just without any driver > > specific code messing with ptrace > > permissions and peek/poke interfaces. > > > > So the question is still why do you > > need the whole eudebug interface in the > > first place? I might be missing > > something, but that seems to be superfluous > > from a high level view. > > > > Recapping from above. It is to allow the > > debugging of EU threads per DRM > > client, completely independent of the CPU > > process. If ptrace_may_acces > > is the sore point, we could consider other > > permission checks, too. There > > is no other connection to ptrace in this > > architecture as single > > permission check to know if PID is fair > > game to access by debugger > > process. > > > > Why no parasitic thread or ptrace: Going > > forward, binding the EU debugging to > > the DRM client would also pave way for > > being able to extend core kernel generated > > core dump with each DRM client's EU > > thread/memory dump. We have similar > > feature called "Offline core dump" enabled > > in the downstream public > > trees for i915, where we currently attach > > the EU thread dump to i915 error state > > and then later combine i915 error state > > with CPU core dump file with a > > tool. > > > > This is relatively little amount of extra > > code, as this baseline series > > already introduces GDB the ability to > > perform the necessary actions. > > It's just the matter of kernel driver > > calling: "stop all threads", then > > copying the memory map and memory contents > > for GPU threads, just like is > > done for CPU threads. > > > > With parasitic thread injection, not sure > > if there is such way forward, > > as it would seem to require to inject quite > > abit more logic to core kernel? > > > > > > It's true that the AMD KFD part has > > still similar functionality, but that is > > because of the broken KFD design of > > tying driver state to the CPU process > > (which makes it inaccessible for gdb > > even with imported render node fd). > > > > Both Sima and I (and partially Dave as > > well) have pushed back on the KFD > > approach. And the long term plan is to > > get rid of such device driver specific > > interface which re-implement existing > > functionality just differently. > > > > Recapping, this series is not adding it > > back. The debugger connection > > is a separate FD from the DRM one, with > > separate IOCTL set. We don't allow > > the DRM FD any new operations based on > > ptrace is attached or not. We > > don't ever do that check even. > > > > We only restrict the opening of the > > debugger connection to given PID with > > ptrace_may_access check for now. That can > > be changed to something else, > > if necessary. > > > > Yeah I think unnecessarily tying gpu processes > > to cpu processes is a bad > > thing, least because even today all the svm > > discussions we have still hit > > clear use-cases, where a 1:1 match is not > > wanted (like multiple gpu svm > > sections with offsets). Not even speaking of > > all the gpu usecases where > > the gpu vm space is still entirely independent > > of the cpu side. > > > > So that's why I think this entirely separate > > approach looks like the right > > one, with ptrace_may_access as the access > > control check to make sure we > > match ptrace on the cpu side. > > > > But there's very obviously a bikeshed to be had > > on what the actual uapi > > should look like, especially how gdb opens up a > > gpu debug access fd. But I > > also think that's not much on drm to decide, > > but whatever gdb wants. And > > then we aim for some consistency on that > > lookup/access control part > > (ideally, I might be missing some reasons why > > this is a bad idea) across > > drm drivers. > > > > > > So you need to have a really really > > good explanation why the eudebug interface > > is actually necessary. > > > > TL;DR The main point is to decouple the > > debugging of the EU workloads from the > > debugging of the CPU process. This avoids > > the interference with the CPU process with > > parasitic thread injection. Further this > > also allows generating a core dump > > without any GDB connected. There are also > > many other smaller pros/cons > > which can be discussed but for the context > > of this patch, this is the > > main one. > > > > So unlike parasitic thread injection, we > > don't unlock any special IOCTL for > > the process under debug to be performed by > > the parasitic thread, but we > > allow the minimal set of operations to be > > performed by GDB as if those were > > done on the EUs themselves. > > > > One can think of it like the minimal subset > > of ptrace but for EU threads, > > not the CPU threads. And thus, building on > > this it's possible to extend > > the core kernel generated core dumps with > > DRM specific extension which > > would contain the EU thread/memory dump. > > > > It might be good to document (in that debugging > > doc patch probably) why > > thread injection is not a great option, and why > > the tradeoffs for > > debugging are different than for for > > checkpoint/restore, where with CRIU > > we landed on doing most of this in userspace, > > and often requiring > > injection threads to make it all work. > > > > Cheers, Sima > > > > > > Regards, Joonas > > > > > > Regards, > > Christian. > > > > > > > > Matt > > > > > > [3]https://patchwork.freedesktop.org/patch/622520/?series=140200&re > > v=6 > > > > > > Regards, > > Christian. > > > > > > Matt > > > > > > Regards, > > Christian. > > > > > >