On Wed, May 03, 2023 at 10:47:43AM +0200, Christian König wrote: > Adding Luben as well. > > Am 03.05.23 um 10:16 schrieb Boris Brezillon: > > [SNIP] > > > To sum-up, we shouldn't call drm_sched_{start,stop,resubmit_jobs}(). > > After the discussion I had with Matthew yesterday on IRC, I > > realized there was no clear agreement on this. Matthew uses those 3 > > helpers in the Xe driver right now, and given he intends to use a > > multi-threaded wq for its 1:1 schedulers run queue, there's no way he > > can get away without calling drm_sched_{start,stop}(). > > drm_sched_resubmit_jobs() can be open-coded in each driver, but I'm > > wondering if it wouldn't be preferable to add a ::resubmit_job() method > > or extend the ::run_job() one to support the resubmit semantics, which, > > AFAIU, is just about enforcing the job done fence (the one returned by > > ::run_job()) doesn't transition from a signaled to an unsignaled state. > > > > But probably more important than providing a generic helper, we should > > document the resubmit semantics (AKA, what should and/or shouldn't be > > done with pending jobs when a recovery happens). Because forbidding > > people to use a generic helper function doesn't give any guarantee that > > they'll do the right thing when coding their own logic, unless we give > > clues about what's considered right/wrong, and the current state of the > > doc is pretty unclear in this regard. > > I should probably talk about the history of the re-submit feature a bit > more. > > Basically AMD came up with re-submission as a cheap way of increasing the > reliability of GPU resets. Problem is that it turned into an absolutely > nightmare. We tried for the last 5 years or so to get that stable and it's > still crashing. > > The first and most major problem is that the kernel doesn't even has the > information if re-submitting jobs is possible or not. For example a job > which has already been pushed to the hw could have grabbed a binary > semaphore and re-submitting it will just wait forever for the semaphore to > be released. > > The second problem is that the dma_fence semantics don't allow to ever > transit the state of a fence from signaled back to unsignaled. This means > that you can't re-use the hw fence and need to allocate a new one, but since > memory allocation is forbidden inside a reset handler as well (YES we need > to better document that part) you actually need to keep a bunch of hw fences > pre-allocated around to make this work. Amdgpu choose to illegally re-use > the hw fence instead which only works with quite extreme hacks. > > The third problem is that the lifetime of the job object was actually > defined very well before we tried to use re-submission. Basically it's just > an intermediate state used between the IOCTL and pushing things to the hw, > introducing this re-submit feature completely messed that up and cause quite > a number of use after free errors in the past which are again only solved by > quite some hacks. > > What we should do in the GPU scheduler instead is the follow: > > 1. Don't support re-submission at all! > Instead we can provide help to drivers to query which fences (scheduler > or hw) are still not signaled yet. > This can then be used to re-create hw state if (and only if!) the driver > knows what it's doing and can actually guarantee that this will work. > E.g. the case for XE where the kernel driver knows the contexts which > were not running at the time and can re-create their queues. > > 2. We can provide both a wq to use for single threaded application as well > as start/stop semantics. > It's just that the start/stop semantics should never touch what was > already submitted, but rather just make sure that we don't get any new > submissions. > I pretty much agree with everything Christian has said here and Xe aligns with this. Let me explain what Xe does. 1. Entity hang (TDR timeout of job on a entity, firmware notifies Xe that a entity hung, entity IOMMU CAT error, etc...): - No re-submission at all - ban the entity - notify the UMD - cleanup all pending jobs / fences 2. Entire GPU hang (worth mentioning with good HW + KMD this *should* never happen): - stop all schedulers (same as a entity in Xe because 1 to 1) - cleanup odd entity state related to communication with the firmware - check if an entity has a job that started but not finished, if so ban it (same mechanism as above) - resubmit all jobs from good entities - start all schedulers (same as a entity in Xe because 1 to 1) The implementation for this in the following file [1]. Search for the drm scheduler functions and you should be able to find implementation easily. If you want to use an ordered work queue to avoid the stop / start dance great do that, in Xe the stop / start dance works. I have extensively tested this and the flow is rock solid and please trust me when I say this as I worked on reset flows in the i915 and fought bugs for years, I still don't think it is in the i915 because we try to do resubmission + crazy races. Because of that I was incredibly paranoid when I implemented this + tested it extensively. I'll post an updated version of my DRM scheduler series [2] on the list soon for the WQ changes, plus whatever else is required for Xe. So my take from this discussion is maybe with a little documentation, we are good. Thoughts? Matt [1] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c [2] https://patchwork.freedesktop.org/series/116055/ > Regards, > Christian. > > > > > Regards, > > > > Boris >