Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

Matthew Brost <matthew.brost@xxxxxxxxx> · Mon, 17 Apr 2023 00:03:21 +0000

On Sat, Apr 08, 2023 at 04:05:20PM +0900, Asahi Lina wrote:
> On 04/04/2023 10.58, Matthew Brost wrote:
> > On Tue, Apr 04, 2023 at 10:07:48AM +0900, Asahi Lina wrote:
> > > Hi, thanks for the Cc!
> > > 
> > 
> > No problem.
> > 
> > > On 04/04/2023 09.22, Matthew Brost wrote:
> > > > Hello,
> > > > 
> > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > > > have been asked to merge our common DRM scheduler patches first as well
> > > > as develop a common solution for long running workloads with the DRM
> > > > scheduler. This RFC series is our first attempt at doing this. We
> > > > welcome any and all feedback.
> > > > 
> > > > This can we thought of as 4 parts detailed below.
> > > > 
> > > > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > > > entity (patches 1-3)
> > > > 
> > > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > > > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > > > severals problems as the DRM was originally designed to schedule jobs on
> > > > hardware queues. The main problem being that DRM scheduler expects the
> > > > submission order of jobs to be the completion order of jobs even across
> > > > multiple entities. This assumption falls apart with a firmware scheduler
> > > > as a firmware scheduler has no concept of jobs and jobs can complete out
> > > > of order. A novel solution for was originally thought of by Faith during
> > > > the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > > > and entity. I believe the AGX driver [3] is using this approach and
> > > > Boris may use approach as well for the Mali driver [4].
> > > > 
> > > > To support a 1 to 1 relationship we move the main execution function
> > > > from a kthread to a work queue and add a new scheduling mode which
> > > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > > > The new scheduling mode should unify all drivers usage with a 1 to 1
> > > > relationship and can be thought of as using scheduler as a dependency /
> > > > infligt job tracker rather than a true scheduler.
> > > 
> > > Yup, we're in the exact same situation with drm/asahi, so this is very
> > > welcome! We've been using the existing scheduler as-is, but this should help
> > > remove some unneeded complexity in this use case.
> > > 
> > 
> > That's the idea.
> > 
> > > Do you want me to pull in this series into our tree and make sure this all
> > > works out for us?
> > > 
> > 
> > We tested this in Xe and it definitely works for us but the more testing
> > the better.
> > 
> 
> I haven't gotten around to testing this series yet, but after more debugging
> of drm_sched issues I want to hear more about how Xe uses the scheduler.
> 
> From what I can tell, and from what Christian says, drm_sched has the hidden
> requirement that all job objects outlive the scheduler. I've run into
> several UAF bugs due to this. Not only that, it also currently has the
> requirement that all drm_sched fences outlive the scheduler object.
> 
> These requirements are subtle and only manifest as kernel oopses in rare
> corner cases, so it wasn't at all obvious to me that this was somehow a
> fundamental design assumption when I started using it.
> 
> As far as I can tell, this design is going to work in 99% of cases for
> global-schedulers-per-GPU models, where those corner cases would have to be
> hit on top of a GPU removal scenario (and GPU remove is... well, not the
> most tested/exercised use case). When the scheduler basically lives forever,
> none of this really matters.
> 
> But with a one-scheduler-per-queue model, how do you deal with this when the
> queue goes away? So far, without any of the partial bugfixes I have sent so
> far (which Christian objected to):
> 
> - If you try to tear down a scheduler with any jobs currently scheduled at
> the hardware, drm_sched will oops when those jobs complete and the hw fences
> signal.
> - If you try to tear down an entity (which should cancel all its pending
> jobs) and then the scheduler it was attached to without actually waiting for
> all the free_job() callbacks to be called on every job that ever existed for
> that entity, you can oops (entity cleanup is asynchronous in some cases like
> killed processes, so it will return before all jobs are freed and then that
> asynchronous process will crash and burn if the scheduler goes away out from
> under its feet). Waiting for job completion fences is not enough for this,
> you have to wait until free_job() has actually been called for all jobs.
> - Even if you actually wait for all jobs to be truly gone and then tear down
> the scheduler, if any scheduler job fences remain alive, that will then oops
> if you try to call the debug functions on them (like cat
> /sys/kernel/debug/dma_buf/bufinfo).
> 
> I tried to fix these things, but Christian objected implying it was the
> driver's job to keep a reference from jobs and hw fences to the scheduler.
> But I find that completely broken, because besides the extra memory/resource
> usage keeping the scheduler alive when you're trying to free resources as
> fast as possible when a process goes away, you can't even use normal
> reference counting for that: if you try to drop the last drm_sched reference
> from within a free_job() callback, the whole thing deadlocks since that will
> be running in the scheduler's thread/workqueue context, which can't free
> itself. So now you both reference count the scheduler from jobs and fences,
> and on top of that you need to outsource drm_sched freeing to a workqueue in
> the driver to make sure you don't deadlock.
> 

This what Xe does, jobs reference the scheduler / entity (xe_engine),
when the reference count of an xe_engine goes to zero we trigger the
teardown process (ping / pong with firmware) via a CLEANUP message, when
teardown is done the last step of killing the scheduler is indeed done
by an async worker as you suggest.

To kill a queue, we just kick the TDR which in turn kills any
outstanding job resulting the xe_engine ref count (at least from the
jobs) going to zero.

If a user holds ref to dma-fence of a job, then yes the scheduler isn't
going to be freed (it can be killed before as described above).

This all seems to work just fine for Xe.

> For job fences this is particularly broken, because those fences can live
> forever signaled and attached to shared buffers and there is no guarantee
> that they will be freed in any kind of reasonable time frame. If they have
> to keep the scheduler that created them alive, that creates a lot of dead
> object junk we have to drag around just because a signaled fence exists
> somewhere.
> 
> For a Rust abstraction we have to do all that tracking and refcounting in
> the abstraction itself to make it safe, which is starting to sound like
> reimplementing half of the job tracking drm_sched itself does just to fix
> the lifetime issues, which really tells me the existing design is not sound
> nor easy to use correctly in general.
> 
> How does Xe deal with this (does it deal with it at all)? What happens when
> you kill -9 a process using the GPU? Does freeing all of this wait for all
> jobs to complete *and be freed* with free_job()? What about exported
> dma_bufs with fences attached from that scheduler? Do you keep the scheduler
> alive for those?

kill -9 would trigger killing of the queue described above. Yes if
fences are exported the scheduler might hold onto some firmware
resources for a bit.

> 
> Personally, after running into all this, and after seeing Christian's
> reaction to me trying to improve the state of things, I'm starting to doubt
> that drm_sched is the right solution at all for firmware-scheduling drivers.
> 
> If you want a workload to try to see if you run into any of these things,
> running and killing lots of things in parallel is a good thing to try (mess
> with the numbers and let it run for a while to see if you can hit any corner
> cases):
> 
> while true; do for i in $(seq 1 10); do timeout -k 0.01 0.05 glxgears &
> done; sleep 0.1; done
>

Tested this and this works in Xe.

Feel free to ping me on IRC if you want to chat more about this.

Matt

> ~~ Lina
>