Am 17.02.20 um 20:38 schrieb Daniel Vetter:
On Mon, Feb 17, 2020 at 7:58 PM Christian König <ckoenig.leichtzumerken@xxxxxxxxx> wrote:Am 17.02.20 um 18:55 schrieb Daniel Vetter:On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote: [SNIP]And is also the sole reason why I started looking into the ww_mutex cursor handling a while back (e.g. the initial version with the horrible macro hack). But this is really really hard to get right. So my thinking for now is to push this series upstream to at least unblock my ongoing P2P work.Hm, but at least the move_notify stuff and the locking nightmare around that feels rushed if we just push that. Otoh it's indeed getting painful, and we'll probably have another few rounds of headaches to sort this all out. What about a config EXPERIMENTAL_DYNAMIC_DMA_BUF depends on BROKEN Wrapped around the new ->move_notify hook, plus all relevant code?
Oh, that is a really good idea and trivial to to do.
My initial thinking was to make all of this part of the core ww_mutex implementation, but then I quickly found that this won't work.This is all supremely nasty (also ttm_bo_validate would need to be improved to handle these sublocks and random new objects that could force a ww_mutex_lock_slow).The next idea was to have it based on dma_resv objects, but as you also figured out you then need to drop the reference to the contended lock somehow... So my current working plan was to use GEM object to avoid the callback...I've heard noise that someone is looking into adding dynamic dma-buf support to stuff like rdma drivers. Because interconnects and big machines. Plus feels a bit awkward to mandate a gem library if you want to use dynamic dma-buf support in your driver. Hence why I think something around dma_resv (but with enough flexibility that it doesn't insist that the contending lock must be a dma_resv itself).
Ok, good to know. So to hell with the idea of using a GEM object.But this also means that we can't do this with a single drop_ref() callback in the context because the context might contains different objects of all kind.....
[SNIP] Oh cool, I was waiting for the upload. Will watch asap. btw slides somewhere?
Attached.
[SNIP] For the rather specific amdgpu case I could work around that by utilizing the HMM work to invalidate page tables on the fly, but that doesn't really help with memory management in general.Yeah, so move_notify is maybe solveable with better hw and hmm,
You don't even need better hardware and HMM.All you need to do is clever locking because in this case you will never export page tables to other devices.
but there's other scenarios where I think the cross-driver ww_mutex locking will be needed, for fundamental reasons. Scenario: - a bunch of gpus in pcie slots, all in the same machine - because pcie is slot a nice interconnect (iirc you guys call yours xgmi or something like that) - working sets that are bigger than vram of a single gpu - lots of buffer sharing ofc
Yeah, completely agree. The issue with the page tables is actually a rather specific use case.
1. So driver has an imported dma-buf, currently not mapped anywhere because ti got thrown out (or first use). 2. Importer calls dma_buf_map_attachement 3. Exporter realizes there's a nice xgmi link and p2p would be much better if that object is in vram. 4. Exporter does ttm_bo_validate or equivalent to get the bo into vram, including eviction and lots of locking 5. In turn this might bite back to the importer through some move_notify of objects still mapped, but at the end of the lru. So ->move_notify might not be the worst, eventually I think we'll need the full locking dance across drivers (or at least across drm_device instances, there might be internal upcasting going on so you get your buffers placed in the right vram and all that directly).
YES, EXACTLY! That's the reason why I'm working on this stuff and not try to get P2P/XGMI/etc.. upstream directly without it.
I mean using P2P without all this is certainly possible, but sooner or later your memory management will just fall apart.
So YES, I totally agree that we need some sort of GEM execution context or something like this to lock buffers on the fly as we try to make room for others.So what's the plan? Merge current series (with the bikesheds address) under this CONFIG_EXPERIMENTAL_DYN_DMABUF and then see where we land from there? Trying to get all the pieces lined up out of tree feels like it's going to be too much :-/
At least I hoped for something like that.Developing this out of tree and especially since I have this only as a background task turned out to be delaying things over and over again.
This way I can get it upstream (even when still under experimental flag) and start to convince our internal team/customers that we should work on investing time into this.
Thanks, Christian.
-DanielRegards, Christian.Cheers, Daniel+ /* When we get an error here it means that somebody + * else is holding the VM lock and updating page tables + * So we can just continue here. + */ + r = dma_resv_lock(resv, ticket); + if (r) + continue; + + } else { + /* TODO: This is more problematic and we actually need + * to allow page tables updates without holding the + * lock. + */ + if (!dma_resv_trylock(resv)) + continue; + } + + r = amdgpu_vm_clear_freed(adev, vm, NULL); + if (!r) + r = amdgpu_vm_handle_moved(adev, vm); + + if (r && r != -EBUSY) + DRM_ERROR("Failed to invalidate VM page tables (%d))\n", + r); + + dma_resv_unlock(resv); + } +} + static const struct dma_buf_attach_ops amdgpu_dma_buf_attach_ops = { + .move_notify = amdgpu_dma_buf_move_notify }; /** @@ -489,7 +553,7 @@ struct drm_gem_object *amdgpu_gem_prime_import(struct drm_device *dev, return obj; attach = dma_buf_dynamic_attach(dma_buf, dev->dev, - &amdgpu_dma_buf_attach_ops, NULL); + &amdgpu_dma_buf_attach_ops, obj); if (IS_ERR(attach)) { drm_gem_object_put(obj); return ERR_CAST(attach); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c index 8ae260822908..8c480c898b0d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -926,6 +926,9 @@ int amdgpu_bo_pin_restricted(struct amdgpu_bo *bo, u32 domain, return 0; } + if (bo->tbo.base.import_attach) + dma_buf_pin(bo->tbo.base.import_attach); + bo->flags |= AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS; /* force to pin into visible video ram */ if (!(bo->flags & AMDGPU_GEM_CREATE_NO_CPU_ACCESS)) @@ -1009,6 +1012,9 @@ int amdgpu_bo_unpin(struct amdgpu_bo *bo) amdgpu_bo_subtract_pin_size(bo); + if (bo->tbo.base.import_attach) + dma_buf_unpin(bo->tbo.base.import_attach); + for (i = 0; i < bo->placement.num_placement; i++) { bo->placements[i].lpfn = 0; bo->placements[i].flags &= ~TTM_PL_FLAG_NO_EVICT; -- 2.17.1
Attachment:
TTM FOSDEM 2020.pdf
Description: Adobe PDF document
_______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/dri-devel