Proposal for a new CS ioctl, kernel pseudo code:
lock(&global_lock);
serial = get_next_serial(dev);
add_wait_command(ring, serial - 1);
add_exec_cmdbuf(ring, user_cmdbuf);
add_signal_command(ring, serial);
*ring->doorbell = FIRE;
unlock(&global_lock);
See? Just like userspace submit, but in the kernel without
concurrency/preemption. Is this now safe enough for dma_fence?
Marek
On Mon, May 3, 2021 at 4:36 PM Marek Olšák <maraeo@xxxxxxxxx
<mailto:maraeo@xxxxxxxxx>> wrote:
What about direct submit from the kernel where the process still
has write access to the GPU ring buffer but doesn't use it? I
think that solves your preemption example, but leaves a potential
backdoor for a process to overwrite the signal commands, which
shouldn't be a problem since we are OK with timeouts.
Marek
On Mon, May 3, 2021 at 11:23 AM Jason Ekstrand
<jason@xxxxxxxxxxxxxx <mailto:jason@xxxxxxxxxxxxxx>> wrote:
On Mon, May 3, 2021 at 10:16 AM Bas Nieuwenhuizen
<bas@xxxxxxxxxxxxxxxxxxx <mailto:bas@xxxxxxxxxxxxxxxxxxx>> wrote:
>
> On Mon, May 3, 2021 at 5:00 PM Jason Ekstrand
<jason@xxxxxxxxxxxxxx <mailto:jason@xxxxxxxxxxxxxx>> wrote:
> >
> > Sorry for the top-post but there's no good thing to reply
to here...
> >
> > One of the things pointed out to me recently by Daniel
Vetter that I
> > didn't fully understand before is that dma_buf has a very
subtle
> > second requirement beyond finite time completion: Nothing
required
> > for signaling a dma-fence can allocate memory. Why?
Because the act
> > of allocating memory may wait on your dma-fence. This, as
it turns
> > out, is a massively more strict requirement than finite time
> > completion and, I think, throws out all of the proposals
we have so
> > far.
> >
> > Take, for instance, Marek's proposal for userspace
involvement with
> > dma-fence by asking the kernel for a next serial and the
kernel
> > trusting userspace to signal it. That doesn't work at all if
> > allocating memory to trigger a dma-fence can blow up.
There's simply
> > no way for the kernel to trust userspace to not do
ANYTHING which
> > might allocate memory. I don't even think there's a way
userspace can
> > trust itself there. It also blows up my plan of moving
the fences to
> > transition boundaries.
> >
> > Not sure where that leaves us.
>
> Honestly the more I look at things I think
userspace-signalable fences
> with a timeout sound like they are a valid solution for
these issues.
> Especially since (as has been mentioned countless times in
this email
> thread) userspace already has a lot of ways to cause
timeouts and or
> GPU hangs through GPU work already.
>
> Adding a timeout on the signaling side of a dma_fence would
ensure:
>
> - The dma_fence signals in finite time
> - If the timeout case does not allocate memory then memory
allocation
> is not a blocker for signaling.
>
> Of course you lose the full dependency graph and we need to
make sure
> garbage collection of fences works correctly when we have
cycles.
> However, the latter sounds very doable and the first sounds
like it is
> to some extent inevitable.
>
> I feel like I'm missing some requirement here given that we
> immediately went to much more complicated things but can't
find it.
> Thoughts?
Timeouts are sufficient to protect the kernel but they make
the fences
unpredictable and unreliable from a userspace PoV. One of the big
problems we face is that, once we expose a dma_fence to userspace,
we've allowed for some pretty crazy potential dependencies that
neither userspace nor the kernel can sort out. Say you have
marek's
"next serial, please" proposal and a multi-threaded application.
Between time time you ask the kernel for a serial and get a
dma_fence
and submit the work to signal that serial, your process may get
preempted, something else shoved in which allocates memory,
and then
we end up blocking on that dma_fence. There's no way
userspace can
predict and defend itself from that.
So I think where that leaves us is that there is no safe place to
create a dma_fence except for inside the ioctl which submits
the work
and only after any necessary memory has been allocated. That's a
pretty stiff requirement. We may still be able to interact with
userspace a bit more explicitly but I think it throws any
notion of
userspace direct submit out the window.
--Jason
> - Bas
> >
> > --Jason
> >
> > On Mon, May 3, 2021 at 9:42 AM Alex Deucher
<alexdeucher@xxxxxxxxx <mailto:alexdeucher@xxxxxxxxx>> wrote:
> > >
> > > On Sat, May 1, 2021 at 6:27 PM Marek Olšák
<maraeo@xxxxxxxxx <mailto:maraeo@xxxxxxxxx>> wrote:
> > > >
> > > > On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer
<michel@xxxxxxxxxxx <mailto:michel@xxxxxxxxxxx>> wrote:
> > > >>
> > > >> On 2021-04-28 8:59 a.m., Christian König wrote:
> > > >> > Hi Dave,
> > > >> >
> > > >> > Am 27.04.21 um 21:23 schrieb Marek Olšák:
> > > >> >> Supporting interop with any device is always
possible. It depends on which drivers we need to interoperate
with and update them. We've already found the path forward for
amdgpu. We just need to find out how many other drivers need
to be updated and evaluate the cost/benefit aspect.
> > > >> >>
> > > >> >> Marek
> > > >> >>
> > > >> >> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie
<airlied@xxxxxxxxx <mailto:airlied@xxxxxxxxx>
<mailto:airlied@xxxxxxxxx <mailto:airlied@xxxxxxxxx>>> wrote:
> > > >> >>
> > > >> >> On Tue, 27 Apr 2021 at 22:06, Christian König
> > > >> >> <ckoenig.leichtzumerken@xxxxxxxxx
<mailto:ckoenig.leichtzumerken@xxxxxxxxx>
<mailto:ckoenig.leichtzumerken@xxxxxxxxx
<mailto:ckoenig.leichtzumerken@xxxxxxxxx>>> wrote:
> > > >> >> >
> > > >> >> > Correct, we wouldn't have synchronization
between device with and without user queues any more.
> > > >> >> >
> > > >> >> > That could only be a problem for A+I Laptops.
> > > >> >>
> > > >> >> Since I think you mentioned you'd only be
enabling this on newer
> > > >> >> chipsets, won't it be a problem for A+A where
one A is a generation
> > > >> >> behind the other?
> > > >> >>
> > > >> >
> > > >> > Crap, that is a good point as well.
> > > >> >
> > > >> >>
> > > >> >> I'm not really liking where this is going btw,
seems like a ill
> > > >> >> thought out concept, if AMD is really going
down the road of designing
> > > >> >> hw that is currently Linux incompatible, you
are going to have to
> > > >> >> accept a big part of the burden in bringing
this support in to more
> > > >> >> than just amd drivers for upcoming generations
of gpu.
> > > >> >>
> > > >> >
> > > >> > Well we don't really like that either, but we have
no other option as far as I can see.
> > > >>
> > > >> I don't really understand what "future hw may remove
support for kernel queues" means exactly. While the
per-context queues can be mapped to userspace directly, they
don't *have* to be, do they? I.e. the kernel driver should be
able to either intercept userspace access to the queues, or in
the worst case do it all itself, and provide the existing
synchronization semantics as needed?
> > > >>
> > > >> Surely there are resource limits for the per-context
queues, so the kernel driver needs to do some kind of
virtualization / multi-plexing anyway, or we'll get sad user
faces when there's no queue available for <current hot game>.
> > > >>
> > > >> I'm probably missing something though, awaiting
enlightenment. :)
> > > >
> > > >
> > > > The hw interface for userspace is that the ring buffer
is mapped to the process address space alongside a doorbell
aperture (4K page) that isn't real memory, but when the CPU
writes into it, it tells the hw scheduler that there are new
GPU commands in the ring buffer. Userspace inserts all the
wait, draw, and signal commands into the ring buffer and then
"rings" the doorbell. It's my understanding that the ring
buffer and the doorbell are always mapped in the same GPU
address space as the process, which makes it very difficult to
emulate the current protected ring buffers in the kernel. The
VMID of the ring buffer is also not changeable.
> > > >
> > >
> > > The doorbell does not have to be mapped into the
process's GPU virtual
> > > address space. The CPU could write to it directly.
Mapping it into
> > > the GPU's virtual address space would allow you to have
a device kick
> > > off work however rather than the CPU. E.g., the GPU
could kick off
> > > it's own work or multiple devices could kick off work
without CPU
> > > involvement.
> > >
> > > Alex
> > >
> > >
> > > > The hw scheduler doesn't do any synchronization and it
doesn't see any dependencies. It only chooses which queue to
execute, so it's really just a simple queue manager handling
the virtualization aspect and not much else.
> > > >
> > > > Marek
> > > > _______________________________________________
> > > > dri-devel mailing list
> > > > dri-devel@xxxxxxxxxxxxxxxxxxxxx
<mailto:dri-devel@xxxxxxxxxxxxxxxxxxxxx>
> > > >
https://lists.freedesktop.org/mailman/listinfo/dri-devel
<https://lists.freedesktop.org/mailman/listinfo/dri-devel>
> > > _______________________________________________
> > > mesa-dev mailing list
> > > mesa-dev@xxxxxxxxxxxxxxxxxxxxx
<mailto:mesa-dev@xxxxxxxxxxxxxxxxxxxxx>
> > > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
<https://lists.freedesktop.org/mailman/listinfo/mesa-dev>
> > _______________________________________________
> > dri-devel mailing list
> > dri-devel@xxxxxxxxxxxxxxxxxxxxx
<mailto:dri-devel@xxxxxxxxxxxxxxxxxxxxx>
> > https://lists.freedesktop.org/mailman/listinfo/dri-devel
<https://lists.freedesktop.org/mailman/listinfo/dri-devel>