Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

Rob Clark <robdclark@xxxxxxxxx> · Fri, 18 Mar 2022 08:12:54 -0700

On Fri, Mar 18, 2022 at 12:42 AM Christian König
<christian.koenig@xxxxxxx> wrote:
>
> Am 17.03.22 um 18:31 schrieb Rob Clark:
> > On Thu, Mar 17, 2022 at 10:27 AM Daniel Vetter <daniel@xxxxxxxx> wrote:
> >> [SNIP]
> >>> (At some point, I'd like to use scheduler for the replay, and actually
> >>> use drm_sched_stop()/etc.. but last time I looked there were still
> >>> some sched bugs in that area which prevented me from deleting a bunch
> >>> of code ;-))
> >> Not sure about your hw, but at least on intel replaying tends to just
> >> result in follow-on fun. And that holds even more so the more complex a
> >> workload is. This is why vk just dies immediately and does not try to
> >> replay anything, offloading it to the app. Same with arb robusteness.
> >> Afaik it's really only media and classic gl which insist that the driver
> >> stack somehow recover.
> > At least for us, each submit must be self-contained (ie. not rely on
> > previous GPU hw state), so in practice replay works out pretty well.
> > The worst case is subsequent submits from same process fail as well
> > (if they depended on something that crashing submit failed to write
> > back to memory.. but in that case they just crash as well and we move
> > on to the next one.. the recent gens (a5xx+ at least) are pretty good
> > about quickly detecting problems and giving us an error irq.
>
> Well I absolutely agree with Daniel.
>
> The whole replay thing AMD did in the scheduler is an absolutely mess
> and should probably be killed with fire.
>
> I strongly recommend not to do the same mistake in other drivers.
>
> If you want to have some replay feature then please make it driver
> specific and don't use anything from the infrastructure in the DRM
> scheduler.

hmm, perhaps I was not clear, but I'm only talking about re-emitting
jobs *following* the faulting one (which could be from other contexts,
etc).. not trying to restart the faulting job.

You *absolutely* need to replay jobs following the faulting one, they
could be from unrelated contexts/processes.  You can't just drop them
on the floor.

Currently it is all driver specific, but I wanted to delete a lot of
code and move to using scheduler to handle faults/timeouts (but
blocked on that until [1] is resolved)

[1] https://patchwork.kernel.org/project/dri-devel/patch/1630457207-13107-2-git-send-email-Monk.Liu@xxxxxxx/

BR,
-R

> Thanks,
> Christian.
>
> >
> > BR,
> > -R
> >
> >> And recovering from a mess in userspace is a lot simpler than trying to
> >> pull of the same magic in the kernel. Plus it also helps with a few of the
> >> dma_fence rules, which is a nice bonus.
> >> -Daniel
> >>
>