Re: [PATCH v6 2/8] drm/ttm: Add ttm_bo_access

Christian König <christian.koenig@xxxxxxx> · Wed, 13 Nov 2024 12:42:35 +0100

    Am 13.11.24 um 11:44 schrieb Thomas Hellström:

      On Wed, 2024-11-13 at 09:37 +0100, Christian König wrote:

        Am 12.11.24 um 17:33 schrieb Thomas Hellström:

          [SNIP]

                This has been extensively discussed already, but was expected
to
really
only be needed for low-on-memory scenarios. However it now
seems
like
the need is much earlier due to the random userptr page
joining
by
core
mm.

              Just to clarify here:

In Long-Running mode with recoverable pagefaults enabled we
don't
have
any preempt-fences, but rather just zap the PTEs pointing to
the
affected memory and flush TLB. So from a memory resource POW a
breakpoint should be safe, and no mmu notifier nor shrinker
will be
blocked.

            That sounds like a HMM based approach which would clearly work.

But where is that? I don't see any HMM based approach anywhere in
the
XE
driver.

          This is a mode that uses recoverable pagefaults to fault either
full
userptr or full bos, and used with DRM_XE_VM_CRATE_FLAG_FAULT_MODE.
(not SVM)!

userptrs in xe are bo-less, and using the vm's resv, but otherwise
using hmm similar to amdgpu: xe_hmm.c

        Yeah, I have seen that one.

          fault servicing:
xe_gt_pagefault.c

PTE zapping on eviction and notifier:
xe_vm_invalidate_vma(), xe_vm.c

        Ah, that was the stuff I was missing.

So the implementation in xe_preempt_fence.c is just for graphics 
submissions? That would make the whole thing much easier to handle.

      Actually it's not, it's intended for long-running mode, but as a
consequence the debugger would be allowed only in fault mode.

    Make sense, yes.

        The only remaining question I can then see is if long running 
submissions with DRM_XE_VM_CRATE_FLAG_FAULT_MODE could potentially
block 
graphics submissions without this flag from accessing the hardware?

      Yes and no. We have a mechanism in place that allows either only fault
mode jobs or non-faulting jobs on the same, what we call "engine
group".
A pagefault on an engine group would block or hamper progress of other
jobs on that engine group.

So let's say a dma-fence job is submitted to an engine group that is
currently running a faulting job. We'd then need to switch mode of the
engine group and, in the exec ioctl we'd (explicitly without preempt-
fences) preempt the faulting job before submitting the dma-fence job
and publishing its fence. This preemption will incur a delay which is
typically the delay of servicing any outstanding pagefaults. It's not
ideal, but the best we can do, and it doesn't affect core memory
management nor does it affect migration blits.

In the debugger case, this delay could be long due to breakpoints, and
that's why enabling the debugger would sit behind a flag and not
something default (I think this was discussed earlier in the thread).
Still, core memory management would be unaffected, and also ofc the
migration blits are completely independent.

    Yeah, that sounds totally sane to me.

    Sorry for the noise then. I didn't realized that you have two
    separate modes of operation.

    Going to reply on the other open questions separately.

    Regards,

    Christian.

/Thomas

        Thanks a lot for pointing this out,
Christian.

          Thanks,
Thomas

            Regards,
Christian.

              Nor will there be any jobs with published dma-fences depending
on
the
job blocked either temporarily by a pagefault or long-term by a
debugger breakpoint.

/Thomas

                If that is done and the memory pre-empt fence is serviced
even
for
debuggable contexts, do you have further concerns with the
presented
approach
from dma-buf and drm/sched perspective?

Regards, Joonas

                  Regards,
Christian.

          This means that a breakpoint or core dump doesn't
halt
GPU
threads, but
          rather suspends them. E.g. all running wave data
is
collected into a state
          bag which can be restored later on.

          I was under the impression that those long
running
compute
threads do
          exactly that, but when the hardware can't switch
out
the
GPU thread/process
          while in a break then that isn't the case.

          As long as you don't find a way to avoid that
this
patch
set is a pretty
          clear NAK from my side as DMA-buf and TTM
maintainer.

      I believe this is addressed above.

      Matt

          What might work is to keep the submission on the
hardware
in the break state
          but forbid any memory access. This way you can
signal
your
preemption fence
          even when the hardware isn't made available.

          Before you continue XE setups a new pre-emption
fence
and
makes sure that
          all page tables etc... are up to date.

          Could be tricky to get this right if completion
fence
based
submissions are
          mixed in as well, but that gives you at least a
direction
you could
          potentially go.

          Regards,
          Christian.

              Regards, Joonas

                  Regards,
                  Christian.

                      Some wash-up thoughts from me below,
but
consider them fairly irrelevant
                      since I think the main driver for
these
big
questions here should be
                      gdb/userspace.

                          Quoting Christian König (2024-11-
07
11:44:33)

                              Am 06.11.24 um 18:00 schrieb
Matthew
Brost:

                                    [SNIP]

                                    This is not a generic
interface
that anyone can freely access. The same
                                    permissions used by
ptrace
are
checked when opening such an interface.
                                    See [1] [2].

[1]
https://patchwork.freedesktop.org/patch/617470/?series=136572&r
e
v=2

[2]
https://patchwork.freedesktop.org/patch/617471/?series=136572&r
e
v=2

                              Thanks a lot for those
pointers,
that
is exactly what I was looking for.

                              And yeah, it is what I
feared. You
are
re-implementing existing functionality,
                              but see below.

                          Could you elaborate on what this
"existing
functionality" exactly is?
                          I do not think this functionality
exists at
this time.

                          The EU debugging architecture for
Xe
specifically avoids the need for GDB
                          to attach with ptrace to the CPU
process or
interfere with the CPU process for
                          the debugging via parasitic
threads or
so.

                          Debugger connection is opened to
the
DRM
driver for given PID (which uses the
                          ptrace may access check for now)
after
which the all DRM client of that
                          PID are exposed to the debugger
process.

                          What we want to expose via that
debugger
connection is the ability for GDB to
                          read/write the different GPU VM
address
spaces (ppGTT for Intel GPUs) just like
                          the EU threads would see them.
Note
that
the layout of the ppGTT is
                          completely up to the userspace
driver
to
setup and is mostly only partially
                          equal to the CPU address space.

                          Specifically as part of
reading/writing the
ppGTT for debugging purposes,
                          there are deep flushes needed:
for
example
flushing instruction cache
                          when adding/removing breakpoints.

                          Maybe that will explain the
background. I
elaborate on this at the end some more.

                                            kmap/vmap are
used
everywhere in the DRM subsystem to access BOs, so I’m
                                            failing to see
the
problem with adding a simple helper based on existing
                                            code.

                                        What#s possible and
often
done is to do kmap/vmap if you need to implement a
                                        CPU copy for
scanout for
example or for copying/validating command buffers.
                                        But that usually
requires
accessing the whole BO and has separate security
                                        checks.

                                        When you want to
access
only
a few bytes of a BO that sounds massively like
                                        a peek/poke like
interface
and we have already rejected that more than once.
                                        There even used to
be
standardized GEM IOCTLs for that which have been
                                        removed by now.

                          Referring to the explanation at
top:
These
IOCTL are not for the debugging target
                          process to issue. The peek/poke
interface
is specifically for GDB only
                          to facilitate the emulation of
memory
reads/writes on the GPU address
                          space as they were done by EUs
themselves.
And to recap: for modifying
                          instructions for example
(add/remove
breakpoint), extra level of cache flushing is
                          needed which is not available to
regular
userspace.

                          I specifically discussed with
Sima on
the
difference before moving forward with this
                          design originally. If something
has
changed
since then, I'm of course happy to rediscuss.

                          However, if this code can't be
added,
not
sure how we would ever be able
                          to implement core dumps for GPU
threads/memory?

                                        If you need to
access
BOs
which are placed in not CPU accessible memory then
                                        implement the
access
callback
for ptrace, see amdgpu_ttm_access_memory for
                                        an example how to
do
this.

                          As also mentioned above, we don't
work
via
ptrace at all when it comes
                          to debugging the EUs. The only
thing
used
for now is the ptrace_may_access to
                          implement similar access
restrictions
as
ptrace has. This can be changed
                          to something else if needed.

                                    Ptrace access via
vm_operations_struct.access → ttm_bo_vm_access.

                                    This series renames
ttm_bo_vm_access to ttm_bo_access, with no code changes.

                                    The above function
accesses
a BO
via kmap if it is in SYSTEM / TT,
                                    which is existing code.

                                    This function is only
exposed to
user space via ptrace permissions.

                          Maybe this sentence is what
caused the
confusion.

                          Userspace is never exposed with
peek/poke
interface, only the debugger
                          connection which is its own FD.

                                    In this series, we
implement
a
function [3] similar to

amdgpu_ttm_access_memory for
the
TTM vfunc access_memory. What is
                                    missing is non-visible
CPU
memory
access, similar to

amdgpu_ttm_access_memory_sdma.
This will be addressed in a follow-up and
                                    was omitted in this
series
given
its complexity.

                                    So, this looks more or
less
identical to AMD's ptrace implementation,
                                    but in GPU address
space.
Again,
I fail to see what the problem is here.
                                    What am I missing?

                              The main question is why
can't you
use
the existing interfaces directly?

                          We're not working on the CPU
address
space
or BOs. We're working
                          strictly on the GPU address space
as
would
be seen by an EU thread if it
                          accessed address X.

                              Additional to the peek/poke
interface
of ptrace Linux has the pidfd_getfd
                              system call, see
here
https://man7.org/linux/man-pages/man2/pidfd_getfd.2.html.

                              The pidfd_getfd() allows to
dup()
the
render node file descriptor into your gdb
                              process. That in turn gives
you
all the
access you need from gdb, including
                              mapping BOs and command
submission
on
behalf of the application.

                          We're not operating on the CPU
address
space nor are we operating on BOs
                          (there is no concept of BO in the
EU
debug
interface). Each VMA in the VM
                          could come from anywhere, only
the
start
address and size matter. And
                          neither do we need to interfere
with
the
command submission of the
                          process under debug.

                              As far as I can see that
allows
for the
same functionality as the eudebug
                              interface, just without any
driver
specific code messing with ptrace
                              permissions and peek/poke
interfaces.

                              So the question is still why
do
you
need the whole eudebug interface in the
                              first place? I might be
missing
something, but that seems to be superfluous
                              from a high level view.

                          Recapping from above. It is to
allow
the
debugging of EU threads per DRM
                          client, completely independent of
the
CPU
process. If ptrace_may_acces
                          is the sore point, we could
consider
other
permission checks, too. There
                          is no other connection to ptrace
in
this
architecture as single
                          permission check to know if PID
is
fair
game to access by debugger
                          process.

                          Why no parasitic thread or
ptrace:
Going
forward, binding the EU debugging to
                          the DRM client would also pave
way for
being able to extend core kernel generated
                          core dump with each DRM client's
EU
thread/memory dump. We have similar
                          feature called "Offline core
dump"
enabled
in the downstream public
                          trees for i915, where we
currently
attach
the EU thread dump to i915 error state
                          and then later combine i915 error
state
with CPU core dump file with a
                          tool.

                          This is relatively little amount
of
extra
code, as this baseline series
                          already introduces GDB the
ability to
perform the necessary actions.
                          It's just the matter of kernel
driver
calling: "stop all threads", then
                          copying the memory map and memory
contents
for GPU threads, just like is
                          done for CPU threads.

                          With parasitic thread injection,
not
sure
if there is such way forward,
                          as it would seem to require to
inject
quite
abit more logic to core kernel?

                              It's true that the AMD KFD
part
has
still similar functionality, but that is
                              because of the broken KFD
design
of
tying driver state to the CPU process
                              (which makes it inaccessible
for
gdb
even with imported render node fd).

                              Both Sima and I (and
partially
Dave as
well) have pushed back on the KFD
                              approach. And the long term
plan
is to
get rid of such device driver specific
                              interface which re-implement
existing
functionality just differently.

                          Recapping, this series is not
adding
it
back. The debugger connection
                          is a separate FD from the DRM
one,
with
separate IOCTL set. We don't allow
                          the DRM FD any new operations
based on
ptrace is attached or not. We
                          don't ever do that check even.

                          We only restrict the opening of
the
debugger connection to given PID with
                          ptrace_may_access check for now.
That
can
be changed to something else,
                          if necessary.

                      Yeah I think unnecessarily tying gpu
processes
to cpu processes is a bad
                      thing, least because even today all
the
svm
discussions we have still hit
                      clear use-cases, where a 1:1 match is
not
wanted (like multiple gpu svm
                      sections with offsets). Not even
speaking
of
all the gpu usecases where
                      the gpu vm space is still entirely
independent
of the cpu side.

                      So that's why I think this entirely
separate
approach looks like the right
                      one, with ptrace_may_access as the
access
control check to make sure we
                      match ptrace on the cpu side.

                      But there's very obviously a bikeshed
to
be had
on what the actual uapi
                      should look like, especially how gdb
opens
up a
gpu debug access fd. But I
                      also think that's not much on drm to
decide,
but whatever gdb wants. And
                      then we aim for some consistency on
that
lookup/access control part
                      (ideally, I might be missing some
reasons
why
this is a bad idea) across
                      drm drivers.

                              So you need to have a really
really
good explanation why the eudebug interface
                              is actually necessary.

                          TL;DR The main point is to
decouple
the
debugging of the EU workloads from the
                          debugging of the CPU process.
This
avoids
the interference with the CPU process with
                          parasitic thread injection.
Further
this
also allows generating a core dump
                          without any GDB connected. There
are
also
many other smaller pros/cons
                          which can be discussed but for
the
context
of this patch, this is the
                          main one.

                          So unlike parasitic thread
injection,
we
don't unlock any special IOCTL for
                          the process under debug to be
performed by
the parasitic thread, but we
                          allow the minimal set of
operations to
be
performed by GDB as if those were
                          done on the EUs themselves.

                          One can think of it like the
minimal
subset
of ptrace but for EU threads,
                          not the CPU threads. And thus,
building on
this it's possible to extend
                          the core kernel generated core
dumps
with
DRM specific extension which
                          would contain the EU
thread/memory
dump.

                      It might be good to document (in that
debugging
doc patch probably) why
                      thread injection is not a great
option,
and why
the tradeoffs for
                      debugging are different than for for
checkpoint/restore, where with CRIU
                      we landed on doing most of this in
userspace,
and often requiring
                      injection threads to make it all
work.

                      Cheers, Sima

                          Regards, Joonas

                              Regards,
                              Christian.

                                    Matt

[3]
https://patchwork.freedesktop.org/patch/622520/?series=140200&r
e
v=6

                                        Regards,
                                        Christian.

                                            Matt

                                                Regards,
                                                Christian.