On Tue, Mar 29, 2022 at 12:25:55PM -0400, Marek Olšák wrote: > I don't know what iris does, but I would guess that the same problems as > with AMD GPUs apply, making GPUs resets very fragile. iris_batch_check_for_reset -> replace_kernel_ctx -> iris_lost_context_state is I think the main call chain of how this is handled/detected. There's also a side-chain which handles -EIO from execbuf. Also this is using non-recoverable contexts, i.e. any time they suffer from a gpu reset (either because guilty themselves, or collateral damage of a reset that shot more than just the guilty context) the context stops entirely and refuses any further execbuf with -EIO. Cheers, Daniel > > Marek > > On Tue., Mar. 29, 2022, 08:14 Christian König, <christian.koenig@xxxxxxx> > wrote: > > > My main question is what does the iris driver better than radeonsi when > > the client doesn't support the robustness extension? > > > > From Daniels description it sounds like they have at least a partial > > recovery mechanism in place. > > > > Apart from that I completely agree to what you said below. > > > > Christian. > > > > Am 26.03.22 um 01:53 schrieb Olsak, Marek: > > > > [AMD Official Use Only] > > > > amdgpu has 2 resets: soft reset and hard reset. > > > > The soft reset is able to recover from an infinite loop and even some GPU > > hangs due to bad shaders or bad states. The soft reset uses a signal that > > kills all currently-running shaders of a certain process (VM context), > > which unblocks the graphics pipeline, so draws and command buffers finish > > but are not correctly. This can then cause a hard hang if the shader was > > supposed to signal work completion through a shader store instruction and a > > non-shader consumer is waiting for it (skipping the store instruction by > > killing the shader won't signal the work, and thus the consumer will be > > stuck, requiring a hard reset). > > > > The hard reset can recover from other hangs, which is great, but it may > > use a PCI reset, which erases VRAM on dGPUs. APUs don't lose memory > > contents, but we should assume that any process that had running jobs on > > the GPU during a GPU reset has its memory resources in an inconsistent > > state, and thus following command buffers can cause another GPU hang. The > > shader store example above is enough to cause another hard hang due to > > incorrect content in memory resources, which can contain synchronization > > primitives that are used internally by the hardware. > > > > Asking the driver to replay a command buffer that caused a hang is a sure > > way to hang it again. Unrelated processes can be affected due to lost VRAM > > or the misfortune of using the GPU while the GPU hang occurred. The window > > system should recreate GPU resources and redraw everything without > > affecting applications. If apps use GL, they should do the same. Processes > > that can't recover by redrawing content can be terminated or left alone, > > but they shouldn't be allowed to submit work to the GPU anymore. > > > > dEQP only exercises the soft reset. I think WebGL is only able to trigger > > a soft reset at this point, but Vulkan can also trigger a hard reset. > > > > Marek > > ------------------------------ > > *From:* Koenig, Christian <Christian.Koenig@xxxxxxx> > > <Christian.Koenig@xxxxxxx> > > *Sent:* March 23, 2022 11:25 > > *To:* Daniel Vetter <daniel@xxxxxxxx> <daniel@xxxxxxxx>; Daniel Stone > > <daniel@xxxxxxxxxxxxx> <daniel@xxxxxxxxxxxxx>; Olsak, Marek > > <Marek.Olsak@xxxxxxx> <Marek.Olsak@xxxxxxx>; Grodzovsky, Andrey > > <Andrey.Grodzovsky@xxxxxxx> <Andrey.Grodzovsky@xxxxxxx> > > *Cc:* Rob Clark <robdclark@xxxxxxxxx> <robdclark@xxxxxxxxx>; Rob Clark > > <robdclark@xxxxxxxxxxxx> <robdclark@xxxxxxxxxxxx>; Sharma, Shashank > > <Shashank.Sharma@xxxxxxx> <Shashank.Sharma@xxxxxxx>; Christian König > > <ckoenig.leichtzumerken@xxxxxxxxx> <ckoenig.leichtzumerken@xxxxxxxxx>; > > Somalapuram, Amaranath <Amaranath.Somalapuram@xxxxxxx> > > <Amaranath.Somalapuram@xxxxxxx>; Abhinav Kumar <quic_abhinavk@xxxxxxxxxxx> > > <quic_abhinavk@xxxxxxxxxxx>; dri-devel <dri-devel@xxxxxxxxxxxxxxxxxxxxx> > > <dri-devel@xxxxxxxxxxxxxxxxxxxxx>; amd-gfx list > > <amd-gfx@xxxxxxxxxxxxxxxxxxxxx> <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>; Deucher, > > Alexander <Alexander.Deucher@xxxxxxx> <Alexander.Deucher@xxxxxxx>; > > Shashank Sharma <contactshashanksharma@xxxxxxxxx> > > <contactshashanksharma@xxxxxxxxx> > > *Subject:* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event > > > > [Adding Marek and Andrey as well] > > > > Am 23.03.22 um 16:14 schrieb Daniel Vetter: > > > On Wed, 23 Mar 2022 at 15:07, Daniel Stone <daniel@xxxxxxxxxxxxx> > > <daniel@xxxxxxxxxxxxx> wrote: > > >> Hi, > > >> > > >> On Mon, 21 Mar 2022 at 16:02, Rob Clark <robdclark@xxxxxxxxx> > > <robdclark@xxxxxxxxx> wrote: > > >>> On Mon, Mar 21, 2022 at 2:30 AM Christian König > > >>> <christian.koenig@xxxxxxx> <christian.koenig@xxxxxxx> wrote: > > >>>> Well you can, it just means that their contexts are lost as well. > > >>> Which is rather inconvenient when deqp-egl reset tests, for example, > > >>> take down your compositor ;-) > > >> Yeah. Or anything WebGL. > > >> > > >> System-wide collateral damage is definitely a non-starter. If that > > >> means that the userspace driver has to do what iris does and ensure > > >> everything's recreated and resubmitted, that works too, just as long > > >> as the response to 'my adblocker didn't detect a crypto miner ad' is > > >> something better than 'shoot the entire user session'. > > > Not sure where that idea came from, I thought at least I made it clear > > > that legacy gl _has_ to recover. It's only vk and arb_robustness gl > > > which should die without recovery attempt. > > > > > > The entire discussion here is who should be responsible for replay and > > > at least if you can decide the uapi, then punting that entirely to > > > userspace is a good approach. > > > > Yes, completely agree. We have the approach of re-submitting things in > > the kernel and that failed quite miserable. > > > > In other words currently a GPU reset has something like a 99% chance to > > get down your whole desktop. > > > > Daniel can you briefly explain what exactly iris does when a lost > > context is detected without gl robustness? > > > > It sounds like you guys got that working quite well. > > > > Thanks, > > Christian. > > > > > > > > Ofc it'd be nice if the collateral damage is limited, i.e. requests > > > not currently on the gpu, or on different engines and all that > > > shouldn't be nuked, if possible. > > > > > > Also ofc since msm uapi is that the kernel tries to recover there's > > > not much we can do there, contexts cannot be shot. But still trying to > > > replay them as much as possible feels a bit like overkill. > > > -Daniel > > > > > >> Cheers, > > >> Daniel > > > > > > > > > > > > -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch