On Fri, Mar 18, 2022 at 12:42 AM Christian König <christian.koenig@xxxxxxx> wrote: > > Am 17.03.22 um 18:31 schrieb Rob Clark: > > On Thu, Mar 17, 2022 at 10:27 AM Daniel Vetter <daniel@xxxxxxxx> wrote: > >> [SNIP] > >>> (At some point, I'd like to use scheduler for the replay, and actually > >>> use drm_sched_stop()/etc.. but last time I looked there were still > >>> some sched bugs in that area which prevented me from deleting a bunch > >>> of code ;-)) > >> Not sure about your hw, but at least on intel replaying tends to just > >> result in follow-on fun. And that holds even more so the more complex a > >> workload is. This is why vk just dies immediately and does not try to > >> replay anything, offloading it to the app. Same with arb robusteness. > >> Afaik it's really only media and classic gl which insist that the driver > >> stack somehow recover. > > At least for us, each submit must be self-contained (ie. not rely on > > previous GPU hw state), so in practice replay works out pretty well. > > The worst case is subsequent submits from same process fail as well > > (if they depended on something that crashing submit failed to write > > back to memory.. but in that case they just crash as well and we move > > on to the next one.. the recent gens (a5xx+ at least) are pretty good > > about quickly detecting problems and giving us an error irq. > > Well I absolutely agree with Daniel. > > The whole replay thing AMD did in the scheduler is an absolutely mess > and should probably be killed with fire. > > I strongly recommend not to do the same mistake in other drivers. > > If you want to have some replay feature then please make it driver > specific and don't use anything from the infrastructure in the DRM > scheduler. hmm, perhaps I was not clear, but I'm only talking about re-emitting jobs *following* the faulting one (which could be from other contexts, etc).. not trying to restart the faulting job. You *absolutely* need to replay jobs following the faulting one, they could be from unrelated contexts/processes. You can't just drop them on the floor. Currently it is all driver specific, but I wanted to delete a lot of code and move to using scheduler to handle faults/timeouts (but blocked on that until [1] is resolved) [1] https://patchwork.kernel.org/project/dri-devel/patch/1630457207-13107-2-git-send-email-Monk.Liu@xxxxxxx/ BR, -R > Thanks, > Christian. > > > > > BR, > > -R > > > >> And recovering from a mess in userspace is a lot simpler than trying to > >> pull of the same magic in the kernel. Plus it also helps with a few of the > >> dma_fence rules, which is a nice bonus. > >> -Daniel > >> >