On Thu, May 04, 2023 at 01:07:29PM +0200, Christian König wrote: > Am 04.05.23 um 06:54 schrieb Matthew Brost: > > On Wed, May 03, 2023 at 10:47:43AM +0200, Christian König wrote: > > > Adding Luben as well. > > > > > > Am 03.05.23 um 10:16 schrieb Boris Brezillon: > > > > [SNIP] > > > > > To sum-up, we shouldn't call drm_sched_{start,stop,resubmit_jobs}(). > > > > After the discussion I had with Matthew yesterday on IRC, I > > > > realized there was no clear agreement on this. Matthew uses those 3 > > > > helpers in the Xe driver right now, and given he intends to use a > > > > multi-threaded wq for its 1:1 schedulers run queue, there's no way he > > > > can get away without calling drm_sched_{start,stop}(). > > > > drm_sched_resubmit_jobs() can be open-coded in each driver, but I'm > > > > wondering if it wouldn't be preferable to add a ::resubmit_job() method > > > > or extend the ::run_job() one to support the resubmit semantics, which, > > > > AFAIU, is just about enforcing the job done fence (the one returned by > > > > ::run_job()) doesn't transition from a signaled to an unsignaled state. > > > > > > > > But probably more important than providing a generic helper, we should > > > > document the resubmit semantics (AKA, what should and/or shouldn't be > > > > done with pending jobs when a recovery happens). Because forbidding > > > > people to use a generic helper function doesn't give any guarantee that > > > > they'll do the right thing when coding their own logic, unless we give > > > > clues about what's considered right/wrong, and the current state of the > > > > doc is pretty unclear in this regard. > > > I should probably talk about the history of the re-submit feature a bit > > > more. > > > > > > Basically AMD came up with re-submission as a cheap way of increasing the > > > reliability of GPU resets. Problem is that it turned into an absolutely > > > nightmare. We tried for the last 5 years or so to get that stable and it's > > > still crashing. > > > > > > The first and most major problem is that the kernel doesn't even has the > > > information if re-submitting jobs is possible or not. For example a job > > > which has already been pushed to the hw could have grabbed a binary > > > semaphore and re-submitting it will just wait forever for the semaphore to > > > be released. > > > > > > The second problem is that the dma_fence semantics don't allow to ever > > > transit the state of a fence from signaled back to unsignaled. This means > > > that you can't re-use the hw fence and need to allocate a new one, but since > > > memory allocation is forbidden inside a reset handler as well (YES we need > > > to better document that part) you actually need to keep a bunch of hw fences > > > pre-allocated around to make this work. Amdgpu choose to illegally re-use > > > the hw fence instead which only works with quite extreme hacks. > > > > > > The third problem is that the lifetime of the job object was actually > > > defined very well before we tried to use re-submission. Basically it's just > > > an intermediate state used between the IOCTL and pushing things to the hw, > > > introducing this re-submit feature completely messed that up and cause quite > > > a number of use after free errors in the past which are again only solved by > > > quite some hacks. > > > > > > What we should do in the GPU scheduler instead is the follow: > > > > > > 1. Don't support re-submission at all! > > > Instead we can provide help to drivers to query which fences (scheduler > > > or hw) are still not signaled yet. > > > This can then be used to re-create hw state if (and only if!) the driver > > > knows what it's doing and can actually guarantee that this will work. > > > E.g. the case for XE where the kernel driver knows the contexts which > > > were not running at the time and can re-create their queues. > > > > > > 2. We can provide both a wq to use for single threaded application as well > > > as start/stop semantics. > > > It's just that the start/stop semantics should never touch what was > > > already submitted, but rather just make sure that we don't get any new > > > submissions. > > > > > I pretty much agree with everything Christian has said here and Xe > > aligns with this. Let me explain what Xe does. > > > > 1. Entity hang (TDR timeout of job on a entity, firmware notifies Xe that a > > entity hung, entity IOMMU CAT error, etc...): > > - No re-submission at all > > - ban the entity > > - notify the UMD > > - cleanup all pending jobs / fences > > 2. Entire GPU hang (worth mentioning with good HW + KMD this *should* > > never happen): > > - stop all schedulers (same as a entity in Xe because 1 to 1) > > - cleanup odd entity state related to communication with the > > firmware > > - check if an entity has a job that started but not finished, if > > so ban it (same mechanism as above) > > - resubmit all jobs from good entities > > As long as you can do this without creating new dma_fence objects for the hw > submissions everything should be fine. > Yep, same fence. > > - start all schedulers (same as a entity in Xe because 1 to 1) > > > > The implementation for this in the following file [1]. Search for the > > drm scheduler functions and you should be able to find implementation > > easily. > > > > If you want to use an ordered work queue to avoid the stop / start dance > > great do that, in Xe the stop / start dance works. I have extensively > > tested this and the flow is rock solid and please trust me when I say > > this as I worked on reset flows in the i915 and fought bugs for years, I > > still don't think it is in the i915 because we try to do resubmission + > > crazy races. Because of that I was incredibly paranoid when I > > implemented this + tested it extensively. > > > > I'll post an updated version of my DRM scheduler series [2] on the list > > soon for the WQ changes, plus whatever else is required for Xe. > > > > So my take from this discussion is maybe with a little documentation, we > > are good. Thoughts? > > For XE what you describe above basically sounds perfectly fine to me. > > I strongly assume when you re-submit things you just tell your hw scheduler > to pick up at the place it left of? E.g. set a read pointer and write > pointer of a ring buffer appropriately? > Yep. Matt > Christian. > > > > > Matt > > > > [1] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c > > [2] https://patchwork.freedesktop.org/series/116055/ > > > > > Regards, > > > Christian. > > > > > > > Regards, > > > > > > > > Boris >