Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

Christian König <ckoenig.leichtzumerken@xxxxxxxxx> · Tue, 4 May 2021 11:14:06 +0200

Am 04.05.21 um 10:27 schrieb Daniel Vetter:
On Tue, May 4, 2021 at 10:09 AM Christian König
<ckoenig.leichtzumerken@xxxxxxxxx> wrote:
Am 04.05.21 um 09:32 schrieb Daniel Vetter:
On Tue, May 04, 2021 at 09:01:23AM +0200, Christian König wrote:
Unfortunately as I pointed out to Daniel as well this won't work 100%
reliable either.
You're claiming this, but there's no clear reason why really, and you
did't reply to my last mail on that sub-thread, so I really don't get
where exactly you're seeing a problem.
Yeah, it's rather hard to explain without pointing out how the hardware
works in detail.

See the signal on the ring buffer needs to be protected by manipulation from
userspace so that we can guarantee that the hardware really has finished
executing when it fires.
Nope you don't. Userspace is already allowed to submit all kinds of random
garbage, the only thing the kernel has to guarnatee is:
- the dma-fence DAG stays a DAG
- dma-fence completes in finite time

Everything else is not the kernel's problem, and if userspace mixes stuff
up like manipulates the seqno, that's ok. It can do that kind of garbage
already.

Protecting memory by immediate page table updates is a good first step, but
unfortunately not sufficient (and we would need to restructure large parts
of the driver to make this happen).
This is why you need the unload-fence on top, because indeed you can't
just rely on the fences created from the userspace ring, those are
unreliable for memory management.
And exactly that's the problem! We can't provide a reliable unload-fence
and the user fences are unreliable for that.

I've talked this through lengthy with our hardware/firmware guy last
Thursday but couldn't find a solution either.

We can have a preemption fence for the kernel which says: Hey this queue
was scheduled away you can touch it's hardware descriptor, control
registers, page tables, TLB, memory, GWS, GDS, OA etc etc etc... again.
But that one is only triggered on preemption and then we have the same
ordering problems once more.

Or we can have a end of operation fence for userspace which says: Hey
this queue has finished it's batch of execution, but this one is
manipulable from userspace in both finish to early (very very bad for
invalidations and memory management) or finish to late/never (deadlock
prone but fixable by timeout).

What we could do is to use the preemption fence to emulate the unload
fence, e.g. something like:
1. Preempt the queue in fixed intervals (let's say 100ms).
2. While preempted check if we have reached the checkpoint in question
by looking at the hardware descriptor.
3. If we have reached the checkpoint signal the unload fence.
4. If we haven't reached the checkpoint resume the queue again.

The problem is that this might introduce a maximum of 100ms delay before
signaling the unload fence and preempt/resume has such a hefty overhead
that we waste a horrible amount of time on it.
So your hw can preempt? That's good enough.

The unload fence is just
1. wait for all dma_fence that are based on the userspace ring. This
is unreliable, but we don't care because tdr will make it reliable.
And once tdr shot down a context we'll force-unload and thrash it
completely, which solves the problem.
2. preempt the context, which /should/ now be stuck waiting for more
commands to be stuffed into the ringbuffer. Which means your
preemption is hopefully fast enough to not matter. If your hw takes
forever to preempt an idle ring, I can't help you :-)

Yeah, it just takes to long for the preemption to complete to be really 
useful for the feature we are discussing here.

As I said when the kernel requests to preempt a queue we can easily 
expect a timeout of ~100ms until that comes back. For compute that is 
even in the multiple seconds range.

The "preemption" feature is really called suspend and made just for the 
case when we want to put a process to sleep or need to forcefully kill 
it for misbehavior or stuff like that. It is not meant to be used in 
normal operation.

If we only attach it on ->move then yeah maybe a last resort possibility 
to do it this way, but I think in that case we could rather stick with 
kernel submissions.

Also, if userspace lies to us and keeps pushing crap into the ring
after it's supposed to be idle: Userspace is already allowed to waste
gpu time. If you're too worried about this set a fairly aggressive
preempt timeout on the unload fence, and kill the context if it takes
longer than what preempting an idle ring should take (because that
would indicate broken/evil userspace).

I think you have the wrong expectation here. It is perfectly valid and 
expected for userspace to keep writing commands into the ring buffer.

After all when one frame is completed they want to immediately start 
rendering the next one.

Again, I'm not seeing the problem. Except if your hw is really
completely busted to the point where it can't even support userspace
ringbuffers properly and with sufficient performance :-P

Of course if you issue the preempt context request before the
userspace fences have finished (or tdr cleaned up the mess) like you
do in your proposal, then it will be ridiculously expensive and/or
wont work. So just don't do that.

btw I thought some more, and I think it's probably best if we only attach
the unload-fence in the ->move(_notify) callbacks. Kinda like we already
do for async copy jobs. So the overall buffer move sequence would be:

1. wait for (untrusted for kernel, but necessary for userspace
correctness) fake dma-fence that rely on the userspace ring

2. unload ctx

3. copy buffer

Ofc 2&3 would be done async behind a dma_fence.

On older hardware we often had the situation that for reliable invalidation
we need the guarantee that every previous operation has finished executing.
It's not so much of a problem when the next operation has already started,
since then we had the opportunity to do things in between the last and the
next operation. Just see cache invalidation and VM switching for example.
If you have gpu page faults you generally have synchronous tlb
invalidation,
Please tell that our hardware engineers :)

We have two modes of operation, see the whole XNACK on/off discussion on
the amdgfx mailing list.
I didn't find this anywhere with a quick search. Pointers to archive
(lore.kernel.org/amd-gfx is the best imo).

Can't find that of hand either, but see the amdgpu_noretry module option.

It basically tells the hardware if retry page faults should be supported 
or not because this whole TLB shutdown thing when they are supported is 
extremely costly.

so this also shouldn't be a big problem. Combined with the
unload fence at least. If you don't have synchronous tlb invalidate it
gets a bit more nasty and you need to force a preemption to a kernel
context which has the required flushes across all the caches. Slightly
nasty, but the exact same thing would be required for handling page faults
anyway with the direct userspace submit model.

Again I'm not seeing a problem.

Additional to that it doesn't really buy us anything, e.g. there is not much
advantage to this. Writing the ring buffer in userspace and then ringing in
the kernel has the same overhead as doing everything in the kernel in the
first place.
It gets you dma-fence backwards compat without having to rewrite the
entire userspace ecosystem. Also since you have the hw already designed
for ringbuffer in userspace it would be silly to copy that through the cs
ioctl, that's just overhead.

Also I thought the problem you're having is that all the kernel ringbuf
stuff is going away, so the old cs ioctl wont work anymore for sure?
We still have a bit more time for this. As I learned from our firmware
engineer last Thursday the Windows side is running into similar problems
as we do.
This story sounds familiar, I've heard it a few times here at intel
too on various things where we complained and then windows hit the
same issues too :-)

E.g. I've just learned that all the things we've discussed around gpu
page faults vs 3d workloads and how you need to reserve some CU for 3d
guaranteed forward progress or even worse measures is also something
they're hitting on Windows. Apparently they fixed it by only running
3d or compute workloads at the same time, but not both.

I'm not even sure if we are going to see user fences on Windows with the 
next hw generation.

Before we can continue with this discussion we need to figure out how to 
get the hardware reliable first.

In other words if we would have explicit user fences everywhere, how 
would we handle timeouts and misbehaving processes? As it turned out 
they haven't figured this out on Windows yet either.

Maybe also pick up that other subthread which ended with my last reply.
I will send out another proposal for how to handle user fences shortly.
Maybe let's discuss this here first before we commit to requiring all
userspace to upgrade to user fences ... I do agree that we want to go
there too, but breaking all the compositors is probably not the best
option.

I was more thinking about handling it all in the kernel.

Christian.

Cheers, Daniel

Cheers,
Christian.

Cheers, Daniel

Christian.

Am 04.05.21 um 05:11 schrieb Marek Olšák:
Proposal for a new CS ioctl, kernel pseudo code:

lock(&global_lock);
serial = get_next_serial(dev);
add_wait_command(ring, serial - 1);
add_exec_cmdbuf(ring, user_cmdbuf);
add_signal_command(ring, serial);
*ring->doorbell = FIRE;
unlock(&global_lock);

See? Just like userspace submit, but in the kernel without
concurrency/preemption. Is this now safe enough for dma_fence?

Marek

On Mon, May 3, 2021 at 4:36 PM Marek Olšák <maraeo@xxxxxxxxx
<mailto:maraeo@xxxxxxxxx>> wrote:

      What about direct submit from the kernel where the process still
      has write access to the GPU ring buffer but doesn't use it? I
      think that solves your preemption example, but leaves a potential
      backdoor for a process to overwrite the signal commands, which
      shouldn't be a problem since we are OK with timeouts.

      Marek

      On Mon, May 3, 2021 at 11:23 AM Jason Ekstrand
      <jason@xxxxxxxxxxxxxx <mailto:jason@xxxxxxxxxxxxxx>> wrote:

          On Mon, May 3, 2021 at 10:16 AM Bas Nieuwenhuizen
          <bas@xxxxxxxxxxxxxxxxxxx <mailto:bas@xxxxxxxxxxxxxxxxxxx>> wrote:
          >
          > On Mon, May 3, 2021 at 5:00 PM Jason Ekstrand
          <jason@xxxxxxxxxxxxxx <mailto:jason@xxxxxxxxxxxxxx>> wrote:
          > >
          > > Sorry for the top-post but there's no good thing to reply
          to here...
          > >
          > > One of the things pointed out to me recently by Daniel
          Vetter that I
          > > didn't fully understand before is that dma_buf has a very
          subtle
          > > second requirement beyond finite time completion:  Nothing
          required
          > > for signaling a dma-fence can allocate memory. Why?
          Because the act
          > > of allocating memory may wait on your dma-fence.  This, as
          it turns
          > > out, is a massively more strict requirement than finite time
          > > completion and, I think, throws out all of the proposals
          we have so
          > > far.
          > >
          > > Take, for instance, Marek's proposal for userspace
          involvement with
          > > dma-fence by asking the kernel for a next serial and the
          kernel
          > > trusting userspace to signal it.  That doesn't work at all if
          > > allocating memory to trigger a dma-fence can blow up.
          There's simply
          > > no way for the kernel to trust userspace to not do
          ANYTHING which
          > > might allocate memory.  I don't even think there's a way
          userspace can
          > > trust itself there.  It also blows up my plan of moving
          the fences to
          > > transition boundaries.
          > >
          > > Not sure where that leaves us.
          >
          > Honestly the more I look at things I think
          userspace-signalable fences
          > with a timeout sound like they are a valid solution for
          these issues.
          > Especially since (as has been mentioned countless times in
          this email
          > thread) userspace already has a lot of ways to cause
          timeouts and or
          > GPU hangs through GPU work already.
          >
          > Adding a timeout on the signaling side of a dma_fence would
          ensure:
          >
          > - The dma_fence signals in finite time
          > -  If the timeout case does not allocate memory then memory
          allocation
          > is not a blocker for signaling.
          >
          > Of course you lose the full dependency graph and we need to
          make sure
          > garbage collection of fences works correctly when we have
          cycles.
          > However, the latter sounds very doable and the first sounds
          like it is
          > to some extent inevitable.
          >
          > I feel like I'm missing some requirement here given that we
          > immediately went to much more complicated things but can't
          find it.
          > Thoughts?

          Timeouts are sufficient to protect the kernel but they make
          the fences
          unpredictable and unreliable from a userspace PoV.  One of the big
          problems we face is that, once we expose a dma_fence to userspace,
          we've allowed for some pretty crazy potential dependencies that
          neither userspace nor the kernel can sort out.  Say you have
          marek's
          "next serial, please" proposal and a multi-threaded application.
          Between time time you ask the kernel for a serial and get a
          dma_fence
          and submit the work to signal that serial, your process may get
          preempted, something else shoved in which allocates memory,
          and then
          we end up blocking on that dma_fence.  There's no way
          userspace can
          predict and defend itself from that.

          So I think where that leaves us is that there is no safe place to
          create a dma_fence except for inside the ioctl which submits
          the work
          and only after any necessary memory has been allocated. That's a
          pretty stiff requirement.  We may still be able to interact with
          userspace a bit more explicitly but I think it throws any
          notion of
          userspace direct submit out the window.

          --Jason

          > - Bas
          > >
          > > --Jason
          > >
          > > On Mon, May 3, 2021 at 9:42 AM Alex Deucher
          <alexdeucher@xxxxxxxxx <mailto:alexdeucher@xxxxxxxxx>> wrote:
          > > >
          > > > On Sat, May 1, 2021 at 6:27 PM Marek Olšák
          <maraeo@xxxxxxxxx <mailto:maraeo@xxxxxxxxx>> wrote:
          > > > >
          > > > > On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer
          <michel@xxxxxxxxxxx <mailto:michel@xxxxxxxxxxx>> wrote:
          > > > >>
          > > > >> On 2021-04-28 8:59 a.m., Christian König wrote:
          > > > >> > Hi Dave,
          > > > >> >
          > > > >> > Am 27.04.21 um 21:23 schrieb Marek Olšák:
          > > > >> >> Supporting interop with any device is always
          possible. It depends on which drivers we need to interoperate
          with and update them. We've already found the path forward for
          amdgpu. We just need to find out how many other drivers need
          to be updated and evaluate the cost/benefit aspect.
          > > > >> >>
          > > > >> >> Marek
          > > > >> >>
          > > > >> >> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie
          <airlied@xxxxxxxxx <mailto:airlied@xxxxxxxxx>
          <mailto:airlied@xxxxxxxxx <mailto:airlied@xxxxxxxxx>>> wrote:
          > > > >> >>
          > > > >> >>     On Tue, 27 Apr 2021 at 22:06, Christian König
          > > > >> >>     <ckoenig.leichtzumerken@xxxxxxxxx
          <mailto:ckoenig.leichtzumerken@xxxxxxxxx>
          <mailto:ckoenig.leichtzumerken@xxxxxxxxx
          <mailto:ckoenig.leichtzumerken@xxxxxxxxx>>> wrote:
          > > > >> >>     >
          > > > >> >>     > Correct, we wouldn't have synchronization
          between device with and without user queues any more.
          > > > >> >>     >
          > > > >> >>     > That could only be a problem for A+I Laptops.
          > > > >> >>
          > > > >> >>     Since I think you mentioned you'd only be
          enabling this on newer
          > > > >> >>     chipsets, won't it be a problem for A+A where
          one A is a generation
          > > > >> >>     behind the other?
          > > > >> >>
          > > > >> >
          > > > >> > Crap, that is a good point as well.
          > > > >> >
          > > > >> >>
          > > > >> >>     I'm not really liking where this is going btw,
          seems like a ill
          > > > >> >>     thought out concept, if AMD is really going
          down the road of designing
          > > > >> >>     hw that is currently Linux incompatible, you
          are going to have to
          > > > >> >>     accept a big part of the burden in bringing
          this support in to more
          > > > >> >>     than just amd drivers for upcoming generations
          of gpu.
          > > > >> >>
          > > > >> >
          > > > >> > Well we don't really like that either, but we have
          no other option as far as I can see.
          > > > >>
          > > > >> I don't really understand what "future hw may remove
          support for kernel queues" means exactly. While the
          per-context queues can be mapped to userspace directly, they
          don't *have* to be, do they? I.e. the kernel driver should be
          able to either intercept userspace access to the queues, or in
          the worst case do it all itself, and provide the existing
          synchronization semantics as needed?
          > > > >>
          > > > >> Surely there are resource limits for the per-context
          queues, so the kernel driver needs to do some kind of
          virtualization / multi-plexing anyway, or we'll get sad user
          faces when there's no queue available for <current hot game>.
          > > > >>
          > > > >> I'm probably missing something though, awaiting
          enlightenment. :)
          > > > >
          > > > >
          > > > > The hw interface for userspace is that the ring buffer
          is mapped to the process address space alongside a doorbell
          aperture (4K page) that isn't real memory, but when the CPU
          writes into it, it tells the hw scheduler that there are new
          GPU commands in the ring buffer. Userspace inserts all the
          wait, draw, and signal commands into the ring buffer and then
          "rings" the doorbell. It's my understanding that the ring
          buffer and the doorbell are always mapped in the same GPU
          address space as the process, which makes it very difficult to
          emulate the current protected ring buffers in the kernel. The
          VMID of the ring buffer is also not changeable.
          > > > >
          > > >
          > > > The doorbell does not have to be mapped into the
          process's GPU virtual
          > > > address space.  The CPU could write to it directly.
          Mapping it into
          > > > the GPU's virtual address space would allow you to have
          a device kick
          > > > off work however rather than the CPU. E.g., the GPU
          could kick off
          > > > it's own work or multiple devices could kick off work
          without CPU
          > > > involvement.
          > > >
          > > > Alex
          > > >
          > > >
          > > > > The hw scheduler doesn't do any synchronization and it
          doesn't see any dependencies. It only chooses which queue to
          execute, so it's really just a simple queue manager handling
          the virtualization aspect and not much else.
          > > > >
          > > > > Marek
          > > > > _______________________________________________
          > > > > dri-devel mailing list
          > > > > dri-devel@xxxxxxxxxxxxxxxxxxxxx
          <mailto:dri-devel@xxxxxxxxxxxxxxxxxxxxx>
          > > > >
          https://lists.freedesktop.org/mailman/listinfo/dri-devel
          <https://lists.freedesktop.org/mailman/listinfo/dri-devel>
          > > > _______________________________________________
          > > > mesa-dev mailing list
          > > > mesa-dev@xxxxxxxxxxxxxxxxxxxxx
          <mailto:mesa-dev@xxxxxxxxxxxxxxxxxxxxx>
          > > > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
          <https://lists.freedesktop.org/mailman/listinfo/mesa-dev>
          > > _______________________________________________
          > > dri-devel mailing list
          > > dri-devel@xxxxxxxxxxxxxxxxxxxxx
          <mailto:dri-devel@xxxxxxxxxxxxxxxxxxxxx>
          > > https://lists.freedesktop.org/mailman/listinfo/dri-devel
          <https://lists.freedesktop.org/mailman/listinfo/dri-devel>

_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel

_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel