Re: [PATCH 1/6] dma-buf: add dynamic DMA-buf handling v13

Christian König <ckoenig.leichtzumerken@xxxxxxxxx> · Wed, 31 Jul 2019 13:19:37 +0200

Am 31.07.19 um 12:39 schrieb Daniel Vetter:
On Wed, Jul 31, 2019 at 11:44 AM Christian König
<ckoenig.leichtzumerken@xxxxxxxxx> wrote:
Am 31.07.19 um 11:12 schrieb Daniel Vetter:
[SNIP]
I think I brought this up before, but new top-post for a clean start.

Use-case I have in mind is something like amdkfd's model, where you have a
list of buffers (per context or whatever) that you always need to have
present. Idea is to also use this for traditional CS for vk/gl, to cut
down on the buffer management overhead, but we'd still allow additional
buffers to be listed per-CS on top of that default working set.

This of course means no implicit sync anymore on these default buffers
(the point is to avoid touching every buffer on every CS, updating fences
would defeat that). That's why the CS can still list additional buffers,
the only reason for that is to add implicit sync fences. Those buffers
would be most likely in the default working set already.

Consequence is that I want the amdkfd model of "evict when needed, but
keep resident by default", but also working implicit fences. And it must
be doable without touching every bo on every CS. Listing possible
implementation options:

- the amdkfd trick doesn't work because it would break implicit fencing -
    any implicit sync would always result in the context getting
    preempted/evicted, which isn't great.
I'm actually working on re-working implicit fencing towards better
supporting this.

Basic idea is that you split up the fences in a reservation object into
three categories:
1. Implicit sync on write.
2. Implicit sync on read.
3. No implicit sync at all.
Not really sure what you want to do here ... implicit sync is opt-in
(or opt-out flag if you need to keep CS backwards compat) per BO/CS.
At least when we discussed this forever at some XDCs consensus was
that storing the implicit sync mode on the BO is not going to work.

Well that's exactly what we are doing for amdgpu and this is working 
perfectly. See flag AMDGPU_GEM_CREATE_EXPLICIT_SYNC.

This should not only help the KFD, but also with amdgpu command
submission and things like page tables updates.

E.g. we need to fences for page tables updates around in reservation
objects as well, but you really really really don't want any implicit
synchronization with them :)
Why do you even try to do implicit sync with your pagetables? How can
your pagetables even get anywhere near where implicit sync matters?

When you unmap a BO from a page table the BO needs to stay in the same 
place until the unmap operation is completed.

Otherwise you open up a short window where a process could access memory 
which doesn't belongs to the process.

This unmap operation is usually an SDMA operation and nobody except for 
the memory management needs to take this into account.

- we share the resv_obj between all the buffers in the default working set
    of a context, with unsharing/resharing the resv_obj if they enter/leave
    the default working set. That way there's only one resv_obj to update on
    each CS, and we can attach a new shared fence for every CS. Trouble is
    that this means a given buffer can only be part of one default working
    set, so all shared buffers would need to be listed again separately. Not
    so great if userspace has to deal with that fairly arbitrary limitation.
Yeah, that is exactly what we do with the per VM BOs in amdgpu.

The limitation that you have only one working set actually turned out to
be not a limitation at all, but rather seen as something welcomed by our
Vulkan guys.
We have per-ctx vms in i915, but I guess even for those sharing will be limited.

In amdgpu we had this funky stuff with bo lists which should represent 
the resources used for a command submission.

But after actually talking to the Vulkan and other userspace guys we 
completely deprecated that.

We settled on having per process resources which are always valid and a 
dynamic list of resources you send to the kernel with each command 
submission.

I also don't really see a way to have an implementation with good
performance where BOs can be in multiple working sets at the same time.

- we allow the ->move_notify callback to add new fences, which the
    exporter needs to wait on before it schedules the pipelined move. This
    also avoids the per-BO update on every CS, and it would allow buffers to
    be shared and to be in multiple default working sets. The downside is
    that ->move_notify needs to be able to cope with added fences, which
    means we might need to grow the shared fences array, which might fail
    with ENOMEM. Not great. We could fix this with some kind of permanent
    shared fence slot reservations (i.e. a reserved slot which outlives
    holding the reservation lock), but that might waste quite a bit of
    memory. Probably not real problem in the grand scheme of things. I think
    the fence itself can be preallocated per context, so that isn't the
    problem.
Well the ENOMEM problem is the least of the problems with this approach.
You can still block for the fence which you wanted to add.

The real problem is that you can't tell if a BO is busy or not by just
looking at its current fences.

In other words when you stop adding fences you also want to stop moving
them individually on the LRU.
Well yeah, otherwise you're back a per BO overhead on CS. That's kinda
obvious. But also not sure why this matters, since atm we don't have
any proposed means to pass LRU updates through dma-buf. So exporter
(even with dynamic dma-buf) has to pessimistically assume already that
the exported BO is more busy than what the LRU position suggests, and
evict them later on. If we also need to funnel LRU updates over
dma-buf, then that's another beast entirely (and probably not what we
want to do at all).

Yeah, that's a really good point for DMA-buf.

But I'm thinking a bit wider here. Essentially we need some ideas which 
not only work for DMA-buf, but also inside the same driver.

Also if you don't want to stall, then just reject after ->move_notify
wherethere the bo is still idle, and give up if so. Or we can add a
counter to indicate a bo needs to be considered busy.

When the memory management evicts one BO you essentially kick out a
whole process/working set.

So when you want to kick out the next BO you actually want to do this
for BOs which now became available anyway.

That approach won't work with the move_notify callback.
So essentially we don't just want move_notify, but also full LRU information?

At least for DMA-buf I was trying to avoid that. But in the long term it 
might be necessary, yes.

- same as above, but the new fence doesn't get added, but returned to the
    caller, and the exporter deals with the ENOMEM mess. Might not work
    since an importer could have a lot of contexts using a given object, and
    so would have a lot of fences to add.
I don't think that this will work.

See you not only need to be able to add the fence to the BO currently
evicted, but also to all other BO in your process/working set.

Additional to that moving the ENOMEM handling from the importer to the
exporter sounds as helpful as adding another layer of abstraction :)
You can go and pick a different victim to evict, or just give up
(there's not much you can do with ENOMEM). Instead of data corruption
because you're not waiting for a fence you should be waiting on.

Ok, but you can as well return -ENOMEM from the move_notify callback.

What I wanted to say is that by returning the error code you are still 
more flexible than returning the fence. On other words returning the 
fence doesn't seem to help...

Christian.

-Daniel

Regards,
Christian.

- something entirely different?

Thoughts?

Cheers, Daniel

_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel