Re: [PATCH drm-misc-next v3 6/7] drm/gpuvm: generalize dma_resv/extobj handling and GEM validation

Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx> · Wed, 13 Sep 2023 12:39:01 +0200

Hi,

On 9/13/23 09:19, Boris Brezillon wrote:
On Wed, 13 Sep 2023 17:05:42 +1000
Dave Airlie <airlied@xxxxxxxxx> wrote:

On Wed, 13 Sept 2023 at 17:03, Boris Brezillon
<boris.brezillon@xxxxxxxxxxxxx> wrote:
On Tue, 12 Sep 2023 18:20:32 +0200
Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx> wrote:

+/**
+ * get_next_vm_bo_from_list() - get the next vm_bo element
+ * @__gpuvm: The GPU VM
+ * @__list_name: The name of the list we're iterating on
+ * @__local_list: A pointer to the local list used to store already iterated items
+ * @__prev_vm_bo: The previous element we got from drm_gpuvm_get_next_cached_vm_bo()
+ *
+ * This helper is here to provide lockless list iteration. Lockless as in, the
+ * iterator releases the lock immediately after picking the first element from
+ * the list, so list insertion deletion can happen concurrently.
Are the list spinlocks needed for that async state update from within
the dma-fence critical section we've discussed previously?
Any driver calling _[un]link() from its drm_gpu_scheduler::run_job()
hook will be in this situation (Panthor at the moment, PowerVR soon). I
get that Xe and Nouveau don't need that because they update the VM
state early (in the ioctl path), but I keep thinking this will hurt us
if we don't think it through from the beginning, because once you've
set this logic to depend only on resv locks, it will be pretty hard to
get back to a solution which lets synchronous VM_BINDs take precedence
on asynchronous request, and, with vkQueueBindSparse() passing external
deps (plus the fact the VM_BIND queue might be pretty deep), it can
take a long time to get your synchronous VM_BIND executed...

So this would boil down to either (possibly opt-in) keeping the spinlock 
approach or pushing the unlink out to a wq then?
BTW, as also asked in a reply to Danilo, how do you call unlink from 
run_job() when it was requiring the obj->dma_resv lock, or was that a WIP?

btw what is the use case for this? do we have actual vulkan
applications we know will have problems here?
I don't, but I think that's a concern Faith raised at some point (dates
back from when I was reading threads describing how VM_BIND on i915
should work, and I was clearly discovering this whole VM_BIND thing at
that time, so maybe I misunderstood).

it feels like a bit of premature optimisation, but maybe we have use cases.
Might be, but that's the sort of thing that would put us in a corner if
we don't have a plan for when the needs arise. Besides, if we don't
want to support that case because it's too complicated, I'd recommend
dropping all the drm_gpuvm APIs that let people think this mode is
valid/supported (map/remap/unmap hooks in drm_gpuvm_ops,
drm_gpuvm_sm_[un]map helpers, etc). Keeping them around just adds to the
confusion.

Xe allows bypassing the bind-queue with another bind-queue, but to 
completely avoid dependencies between queues the Operations may not 
overlap.  (And the definition of overlap is currently page-table 
structure updates may not overlap) but no guarantees are made about 
priority.

/Thomas