On Wed, Mar 23, 2022 at 8:14 AM Daniel Vetter <daniel@xxxxxxxx> wrote: > > On Wed, 23 Mar 2022 at 15:07, Daniel Stone <daniel@xxxxxxxxxxxxx> wrote: > > > > Hi, > > > > On Mon, 21 Mar 2022 at 16:02, Rob Clark <robdclark@xxxxxxxxx> wrote: > > > On Mon, Mar 21, 2022 at 2:30 AM Christian König > > > <christian.koenig@xxxxxxx> wrote: > > > > Well you can, it just means that their contexts are lost as well. > > > > > > Which is rather inconvenient when deqp-egl reset tests, for example, > > > take down your compositor ;-) > > > > Yeah. Or anything WebGL. > > > > System-wide collateral damage is definitely a non-starter. If that > > means that the userspace driver has to do what iris does and ensure > > everything's recreated and resubmitted, that works too, just as long > > as the response to 'my adblocker didn't detect a crypto miner ad' is > > something better than 'shoot the entire user session'. > > Not sure where that idea came from, I thought at least I made it clear > that legacy gl _has_ to recover. It's only vk and arb_robustness gl > which should die without recovery attempt. > > The entire discussion here is who should be responsible for replay and > at least if you can decide the uapi, then punting that entirely to > userspace is a good approach. > > Ofc it'd be nice if the collateral damage is limited, i.e. requests > not currently on the gpu, or on different engines and all that > shouldn't be nuked, if possible. > > Also ofc since msm uapi is that the kernel tries to recover there's > not much we can do there, contexts cannot be shot. But still trying to > replay them as much as possible feels a bit like overkill. It would perhaps be nice if older gens which don't (yet) have per-process pgtables to have gone with the userspace-replays (although that would require a lot more tracking in userspace than what is done currently).. but fortunately those older gens don't use "state objects" which could potentially be corrupted, but instead re-emit state in cmdstream, so there is a lot less possibility for bad collateral damage. (On all the gens we also use gpu read-only buffers whenever the gpu does not need to be able to write them.) For newer stuff, the process isolation works pretty well. In fact we recently changed MSM_PARAM_FAULTS to only report faults/hangs in the same address space, so the compositor is not even aware (and doesn't need to be aware). BR, -R > -Daniel > > > Cheers, > > Daniel > > > > -- > Daniel Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch