Am 31.07.19 um 12:39 schrieb Daniel Vetter:
On Wed, Jul 31, 2019 at 11:44 AM Christian König
<ckoenig.leichtzumerken@xxxxxxxxx> wrote:
Am 31.07.19 um 11:12 schrieb Daniel Vetter:
[SNIP]
I think I brought this up before, but new top-post for a clean start.
Use-case I have in mind is something like amdkfd's model, where you have a
list of buffers (per context or whatever) that you always need to have
present. Idea is to also use this for traditional CS for vk/gl, to cut
down on the buffer management overhead, but we'd still allow additional
buffers to be listed per-CS on top of that default working set.
This of course means no implicit sync anymore on these default buffers
(the point is to avoid touching every buffer on every CS, updating fences
would defeat that). That's why the CS can still list additional buffers,
the only reason for that is to add implicit sync fences. Those buffers
would be most likely in the default working set already.
Consequence is that I want the amdkfd model of "evict when needed, but
keep resident by default", but also working implicit fences. And it must
be doable without touching every bo on every CS. Listing possible
implementation options:
- the amdkfd trick doesn't work because it would break implicit fencing -
any implicit sync would always result in the context getting
preempted/evicted, which isn't great.
I'm actually working on re-working implicit fencing towards better
supporting this.
Basic idea is that you split up the fences in a reservation object into
three categories:
1. Implicit sync on write.
2. Implicit sync on read.
3. No implicit sync at all.
Not really sure what you want to do here ... implicit sync is opt-in
(or opt-out flag if you need to keep CS backwards compat) per BO/CS.
At least when we discussed this forever at some XDCs consensus was
that storing the implicit sync mode on the BO is not going to work.
Well that's exactly what we are doing for amdgpu and this is working
perfectly. See flag AMDGPU_GEM_CREATE_EXPLICIT_SYNC.
This should not only help the KFD, but also with amdgpu command
submission and things like page tables updates.
E.g. we need to fences for page tables updates around in reservation
objects as well, but you really really really don't want any implicit
synchronization with them :)
Why do you even try to do implicit sync with your pagetables? How can
your pagetables even get anywhere near where implicit sync matters?
When you unmap a BO from a page table the BO needs to stay in the same
place until the unmap operation is completed.
Otherwise you open up a short window where a process could access memory
which doesn't belongs to the process.
This unmap operation is usually an SDMA operation and nobody except for
the memory management needs to take this into account.
- we share the resv_obj between all the buffers in the default working set
of a context, with unsharing/resharing the resv_obj if they enter/leave
the default working set. That way there's only one resv_obj to update on
each CS, and we can attach a new shared fence for every CS. Trouble is
that this means a given buffer can only be part of one default working
set, so all shared buffers would need to be listed again separately. Not
so great if userspace has to deal with that fairly arbitrary limitation.
Yeah, that is exactly what we do with the per VM BOs in amdgpu.
The limitation that you have only one working set actually turned out to
be not a limitation at all, but rather seen as something welcomed by our
Vulkan guys.
We have per-ctx vms in i915, but I guess even for those sharing will be limited.
In amdgpu we had this funky stuff with bo lists which should represent
the resources used for a command submission.
But after actually talking to the Vulkan and other userspace guys we
completely deprecated that.
We settled on having per process resources which are always valid and a
dynamic list of resources you send to the kernel with each command
submission.
I also don't really see a way to have an implementation with good
performance where BOs can be in multiple working sets at the same time.
- we allow the ->move_notify callback to add new fences, which the
exporter needs to wait on before it schedules the pipelined move. This
also avoids the per-BO update on every CS, and it would allow buffers to
be shared and to be in multiple default working sets. The downside is
that ->move_notify needs to be able to cope with added fences, which
means we might need to grow the shared fences array, which might fail
with ENOMEM. Not great. We could fix this with some kind of permanent
shared fence slot reservations (i.e. a reserved slot which outlives
holding the reservation lock), but that might waste quite a bit of
memory. Probably not real problem in the grand scheme of things. I think
the fence itself can be preallocated per context, so that isn't the
problem.
Well the ENOMEM problem is the least of the problems with this approach.
You can still block for the fence which you wanted to add.
The real problem is that you can't tell if a BO is busy or not by just
looking at its current fences.
In other words when you stop adding fences you also want to stop moving
them individually on the LRU.
Well yeah, otherwise you're back a per BO overhead on CS. That's kinda
obvious. But also not sure why this matters, since atm we don't have
any proposed means to pass LRU updates through dma-buf. So exporter
(even with dynamic dma-buf) has to pessimistically assume already that
the exported BO is more busy than what the LRU position suggests, and
evict them later on. If we also need to funnel LRU updates over
dma-buf, then that's another beast entirely (and probably not what we
want to do at all).
Yeah, that's a really good point for DMA-buf.
But I'm thinking a bit wider here. Essentially we need some ideas which
not only work for DMA-buf, but also inside the same driver.
Also if you don't want to stall, then just reject after ->move_notify
wherethere the bo is still idle, and give up if so. Or we can add a
counter to indicate a bo needs to be considered busy.
When the memory management evicts one BO you essentially kick out a
whole process/working set.
So when you want to kick out the next BO you actually want to do this
for BOs which now became available anyway.
That approach won't work with the move_notify callback.
So essentially we don't just want move_notify, but also full LRU information?
At least for DMA-buf I was trying to avoid that. But in the long term it
might be necessary, yes.
- same as above, but the new fence doesn't get added, but returned to the
caller, and the exporter deals with the ENOMEM mess. Might not work
since an importer could have a lot of contexts using a given object, and
so would have a lot of fences to add.
I don't think that this will work.
See you not only need to be able to add the fence to the BO currently
evicted, but also to all other BO in your process/working set.
Additional to that moving the ENOMEM handling from the importer to the
exporter sounds as helpful as adding another layer of abstraction :)
You can go and pick a different victim to evict, or just give up
(there's not much you can do with ENOMEM). Instead of data corruption
because you're not waiting for a fence you should be waiting on.
Ok, but you can as well return -ENOMEM from the move_notify callback.
What I wanted to say is that by returning the error code you are still
more flexible than returning the fence. On other words returning the
fence doesn't seem to help...
Christian.
-Daniel
Regards,
Christian.
- something entirely different?
Thoughts?
Cheers, Daniel
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx