On Fri, Mar 18, 2022 at 08:12:54AM -0700, Rob Clark wrote: > On Fri, Mar 18, 2022 at 12:42 AM Christian König > <christian.koenig@xxxxxxx> wrote: > > > > Am 17.03.22 um 18:31 schrieb Rob Clark: > > > On Thu, Mar 17, 2022 at 10:27 AM Daniel Vetter <daniel@xxxxxxxx> wrote: > > >> [SNIP] > > >>> (At some point, I'd like to use scheduler for the replay, and actually > > >>> use drm_sched_stop()/etc.. but last time I looked there were still > > >>> some sched bugs in that area which prevented me from deleting a bunch > > >>> of code ;-)) > > >> Not sure about your hw, but at least on intel replaying tends to just > > >> result in follow-on fun. And that holds even more so the more complex a > > >> workload is. This is why vk just dies immediately and does not try to > > >> replay anything, offloading it to the app. Same with arb robusteness. > > >> Afaik it's really only media and classic gl which insist that the driver > > >> stack somehow recover. > > > At least for us, each submit must be self-contained (ie. not rely on > > > previous GPU hw state), so in practice replay works out pretty well. > > > The worst case is subsequent submits from same process fail as well > > > (if they depended on something that crashing submit failed to write > > > back to memory.. but in that case they just crash as well and we move > > > on to the next one.. the recent gens (a5xx+ at least) are pretty good > > > about quickly detecting problems and giving us an error irq. > > > > Well I absolutely agree with Daniel. > > > > The whole replay thing AMD did in the scheduler is an absolutely mess > > and should probably be killed with fire. > > > > I strongly recommend not to do the same mistake in other drivers. > > > > If you want to have some replay feature then please make it driver > > specific and don't use anything from the infrastructure in the DRM > > scheduler. > > hmm, perhaps I was not clear, but I'm only talking about re-emitting > jobs *following* the faulting one (which could be from other contexts, > etc).. not trying to restart the faulting job. You absolutely can drop jobs on the floor, this is what both anv and iris expect. They use what we call non-recoverable context, meaning when any gpu hang happens and the context is affect (whether as the guilty on, or because it was a multi-engine reset and it was victimized) we kill it entirely. No replaying, and any further execbuf ioctl fails with -EIO. Userspace then gets to sort out the mess, which for vk is VK_ERROR_DEVICE_LOST, for robust gl it's the same, and for non-robust gl iris re-creates a pile of things. Anything in-between _is_ dropped on the floor completely. Also note that this is obviously uapi, if you have an userspace which expect contexts to survive, then replaying makes some sense. > You *absolutely* need to replay jobs following the faulting one, they > could be from unrelated contexts/processes. You can't just drop them > on the floor. > > Currently it is all driver specific, but I wanted to delete a lot of > code and move to using scheduler to handle faults/timeouts (but > blocked on that until [1] is resolved) Yeah for the drivers where the uapi is "you can safely replay after a hang, and you're supposed to", then sharing the code is ofc a good idea. Just wanted to make it clear that this is only one of many uapi flavours you can pick from, dropping it all on the floor is a perfectly legit approach :-) And imo it's the more robust one, and also better fits with latest apis like gl_arb_robustness or vk. Cheers, Daniel > > [1] https://patchwork.kernel.org/project/dri-devel/patch/1630457207-13107-2-git-send-email-Monk.Liu@xxxxxxx/ > > BR, > -R > > > Thanks, > > Christian. > > > > > > > > BR, > > > -R > > > > > >> And recovering from a mess in userspace is a lot simpler than trying to > > >> pull of the same magic in the kernel. Plus it also helps with a few of the > > >> dma_fence rules, which is a nice bonus. > > >> -Daniel > > >> > > -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch