On Sat, Apr 08, 2023 at 04:05:20PM +0900, Asahi Lina wrote: > On 04/04/2023 10.58, Matthew Brost wrote: > > On Tue, Apr 04, 2023 at 10:07:48AM +0900, Asahi Lina wrote: > > > Hi, thanks for the Cc! > > > > > > > No problem. > > > > > On 04/04/2023 09.22, Matthew Brost wrote: > > > > Hello, > > > > > > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we > > > > have been asked to merge our common DRM scheduler patches first as well > > > > as develop a common solution for long running workloads with the DRM > > > > scheduler. This RFC series is our first attempt at doing this. We > > > > welcome any and all feedback. > > > > > > > > This can we thought of as 4 parts detailed below. > > > > > > > > - DRM scheduler changes for 1 to 1 relationship between scheduler and > > > > entity (patches 1-3) > > > > > > > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the > > > > GuC) which is a new paradigm WRT to the DRM scheduler and presents > > > > severals problems as the DRM was originally designed to schedule jobs on > > > > hardware queues. The main problem being that DRM scheduler expects the > > > > submission order of jobs to be the completion order of jobs even across > > > > multiple entities. This assumption falls apart with a firmware scheduler > > > > as a firmware scheduler has no concept of jobs and jobs can complete out > > > > of order. A novel solution for was originally thought of by Faith during > > > > the initial prototype of Xe, create a 1 to 1 relationship between scheduler > > > > and entity. I believe the AGX driver [3] is using this approach and > > > > Boris may use approach as well for the Mali driver [4]. > > > > > > > > To support a 1 to 1 relationship we move the main execution function > > > > from a kthread to a work queue and add a new scheduling mode which > > > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship. > > > > The new scheduling mode should unify all drivers usage with a 1 to 1 > > > > relationship and can be thought of as using scheduler as a dependency / > > > > infligt job tracker rather than a true scheduler. > > > > > > Yup, we're in the exact same situation with drm/asahi, so this is very > > > welcome! We've been using the existing scheduler as-is, but this should help > > > remove some unneeded complexity in this use case. > > > > > > > That's the idea. > > > > > Do you want me to pull in this series into our tree and make sure this all > > > works out for us? > > > > > > > We tested this in Xe and it definitely works for us but the more testing > > the better. > > > > I haven't gotten around to testing this series yet, but after more debugging > of drm_sched issues I want to hear more about how Xe uses the scheduler. > > From what I can tell, and from what Christian says, drm_sched has the hidden > requirement that all job objects outlive the scheduler. I've run into > several UAF bugs due to this. Not only that, it also currently has the > requirement that all drm_sched fences outlive the scheduler object. > > These requirements are subtle and only manifest as kernel oopses in rare > corner cases, so it wasn't at all obvious to me that this was somehow a > fundamental design assumption when I started using it. > > As far as I can tell, this design is going to work in 99% of cases for > global-schedulers-per-GPU models, where those corner cases would have to be > hit on top of a GPU removal scenario (and GPU remove is... well, not the > most tested/exercised use case). When the scheduler basically lives forever, > none of this really matters. > > But with a one-scheduler-per-queue model, how do you deal with this when the > queue goes away? So far, without any of the partial bugfixes I have sent so > far (which Christian objected to): > > - If you try to tear down a scheduler with any jobs currently scheduled at > the hardware, drm_sched will oops when those jobs complete and the hw fences > signal. > - If you try to tear down an entity (which should cancel all its pending > jobs) and then the scheduler it was attached to without actually waiting for > all the free_job() callbacks to be called on every job that ever existed for > that entity, you can oops (entity cleanup is asynchronous in some cases like > killed processes, so it will return before all jobs are freed and then that > asynchronous process will crash and burn if the scheduler goes away out from > under its feet). Waiting for job completion fences is not enough for this, > you have to wait until free_job() has actually been called for all jobs. > - Even if you actually wait for all jobs to be truly gone and then tear down > the scheduler, if any scheduler job fences remain alive, that will then oops > if you try to call the debug functions on them (like cat > /sys/kernel/debug/dma_buf/bufinfo). > > I tried to fix these things, but Christian objected implying it was the > driver's job to keep a reference from jobs and hw fences to the scheduler. > But I find that completely broken, because besides the extra memory/resource > usage keeping the scheduler alive when you're trying to free resources as > fast as possible when a process goes away, you can't even use normal > reference counting for that: if you try to drop the last drm_sched reference > from within a free_job() callback, the whole thing deadlocks since that will > be running in the scheduler's thread/workqueue context, which can't free > itself. So now you both reference count the scheduler from jobs and fences, > and on top of that you need to outsource drm_sched freeing to a workqueue in > the driver to make sure you don't deadlock. > This what Xe does, jobs reference the scheduler / entity (xe_engine), when the reference count of an xe_engine goes to zero we trigger the teardown process (ping / pong with firmware) via a CLEANUP message, when teardown is done the last step of killing the scheduler is indeed done by an async worker as you suggest. To kill a queue, we just kick the TDR which in turn kills any outstanding job resulting the xe_engine ref count (at least from the jobs) going to zero. If a user holds ref to dma-fence of a job, then yes the scheduler isn't going to be freed (it can be killed before as described above). This all seems to work just fine for Xe. > For job fences this is particularly broken, because those fences can live > forever signaled and attached to shared buffers and there is no guarantee > that they will be freed in any kind of reasonable time frame. If they have > to keep the scheduler that created them alive, that creates a lot of dead > object junk we have to drag around just because a signaled fence exists > somewhere. > > For a Rust abstraction we have to do all that tracking and refcounting in > the abstraction itself to make it safe, which is starting to sound like > reimplementing half of the job tracking drm_sched itself does just to fix > the lifetime issues, which really tells me the existing design is not sound > nor easy to use correctly in general. > > How does Xe deal with this (does it deal with it at all)? What happens when > you kill -9 a process using the GPU? Does freeing all of this wait for all > jobs to complete *and be freed* with free_job()? What about exported > dma_bufs with fences attached from that scheduler? Do you keep the scheduler > alive for those? kill -9 would trigger killing of the queue described above. Yes if fences are exported the scheduler might hold onto some firmware resources for a bit. > > Personally, after running into all this, and after seeing Christian's > reaction to me trying to improve the state of things, I'm starting to doubt > that drm_sched is the right solution at all for firmware-scheduling drivers. > > If you want a workload to try to see if you run into any of these things, > running and killing lots of things in parallel is a good thing to try (mess > with the numbers and let it run for a while to see if you can hit any corner > cases): > > while true; do for i in $(seq 1 10); do timeout -k 0.01 0.05 glxgears & > done; sleep 0.1; done > Tested this and this works in Xe. Feel free to ping me on IRC if you want to chat more about this. Matt > ~~ Lina >