Quoting Mika Kuoppala (2018-03-28 08:58:38) > Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> writes: > > > Tvrtko uncovered a fun issue with recovering from a wedge device. In his > > tests, he wedged the driver by injecting an unrecoverable hang whilst a > > batch was spinning. As we reset the gpu in the middle of the spinner, > > when resumed it would continue on from the next instruction in the ring > > and write it's breadcrumb. However, on wedging we updated our > > bookkeeping to indicate that the GPU had completed executing and would > > restart from after the breadcrumb; so the emission of the stale > > breadcrumb from before the reset came as a bit of a surprise. > > > > Ok trying to make sense of the above and how the wedging works. > Here is my assertions. > > The spinning batch was never found to be guilty of anything. It was definitely guilty. > On wedge we fast forwarded all engine seqnos to be what > was last submitted. Correct. > We did hw reset. Correct. > On context image, the RING_HEAD was pointing to bb start > of spin batch (or the instruction after it) Instruction after. > On resubmitting the context, we saw a seqno write from pre > reset era. Correct. > So this doesn't affect only spinning batches but any busy > batch that was running while we wedged? Correct. Any execlists recovery from _wedged_ would be prone to hitting this bug. legacy submission already applies the ring registers reset on recovery. Thinking of which, if we could, we should ban all contexts on wedging? Or at least process the ban accounting for a failed reset. That sounds more plausible (set_wedge() is a nasty lockless affair). -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx