Re: [PATCH] drm/sched: Add native dependency support to drm_sched

Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> · Mon, 12 Jun 2023 16:59:02 +0200

Hi Christian,

On Mon, 12 Jun 2023 15:16:03 +0200
Christian König <christian.koenig@xxxxxxx> wrote:

> Am 08.06.23 um 15:23 schrieb Donald Robson:
> > This patch adds support for 'native' dependencies to DRM scheduler.  In
> > drivers that use a firmware based scheduler there are performance gains
> > to be had by allowing waits to happen in the firmware, as this reduces
> > the latency between signalling and job submission.  
> 
> Well, if I'm not completely mistaken this patch here is superfluous 
> since we already use that functionality.
> 
> This strongly sounds like the HW dependencies we have in amdgpu. See 
> AMDGPU_CHUNK_ID_SCHEDULED_DEPENDENCIES.

I'll look at it in more details. Thanks for the pointer.

> 
> Basically userspace can instead of giving a hard dependency to finish 
> something before the current submission starts also give a soft 
> dependency and only let the other submission be scheduled.
> 
> This way you can then wait for the firmware for certain operations of 
> the previous submission to complete by comparing memory or registers.
> 
> You don't necessarily need to give control over this to userspace, if 
> your kernel driver can determine a fw assisted wait by itself that 
> should also work fine.

That's what we did initially. We had a separate 'native_deps' xarray in
pvr_job that we were managing ourselves, and that worked fine, except
for the kill_entity() stuff. If you don't wait for those
'native-fences', you're potentially signaling the job finished fence
earlier than it should.

Just to make sure we're on the same page, the native fences we
have here are really dma_fences that can be waited upon FW side:
they're not exposed to userspace, the GPU can't access the memory
region containing the counter (it's visible to the FW VM only, and
a kernel side CPU mapping), and we do make sure they signal in finite
time thanks to the job timeout. Feels a bit different compared to
USER_FENCEs most GPUs have nowadays, on which you don't have this
isolation guarantee, and which, AFAIU, are currently used to do some
advanced userspace driven scheduling between queues belonging to the
same context. My understanding, after discussing it with Daniel a few
weeks back, was that exposing USER_FENCEs as dma_fences was risky,
especially if they're used to do inter-context synchronization,
but the FW-visible-only ones were okay to expose as dma_fences. Maybe I
misunderstood what he suggested.

I'm done with this digression, now back to the original topic: we can of
course wait for all those native fences before calling
drm_sched_enity_destroy(), but that's a bit weird to do some partial
wait in the driver while the entity is still active (pretty sure that's
racy anyway), and then delegate the rest to the core.

If we decide we don't care about waiting for native fences when
killing jobs in the kill_entity() path, because we assume drm_resv is
covering us, that's fine, but then I don't really see why
drm_sched_kill_entity() should wait at all, because this 'should wait,
but maybe not for all your deps' behavior is quite confusing.

Regards,

Boris