Re: [RFC] How to assign blame when multiple rings are hung

Daniel Vetter <daniel@xxxxxxxx> · Tue, 28 Jan 2014 20:36:49 +0100



On Tue, Jan 28, 2014 at 12:39:40PM +0000, Chris Wilson wrote:
> On Tue, Jan 28, 2014 at 01:16:34PM +0200, Mika Kuoppala wrote:
> > Hi,
> > 
> > I am working with a patchset [1] which, originally, aimed to fix
> > how we find out the guilty batches with ppgtt.
> > 
> > But during the review it became clear that I don't have a clear
> > idea how the behaviour should be when multiple rings encounter
> > a problematic batch at the same time.
> > 
> > The following i-g-t patch will add test which asserts that
> > both contexts get blame of having (problematic) batch active
> > during hang.
> > 
> > The patch set [1] will fail with this test case as it will
> > blame only the first context that injected the hang.
> > We would need to change the test to for it to pass:
> > -       assert_reset_status(fd[1], 0, RS_BATCH_ACTIVE);
> > +       assert_reset_status(fd[1], 0, RS_BATCH_PENDING);
> > 
> > I lean towards that both contexts get their batch_active count
> > increased. As other rings might gain contexts and we could
> > already reset individual rings instead of whole GPU.
> > 
> > But we need to take a pick so thats why the RFC.
> > Thoughts?
> 
> Assuming idealised code, both get blamed today. Which gets blamed first
> is decided at random (whichever accumulates hangscore quickest), that
> triggers either a full GPU reset and replay of unaffected batches, or a
> ring reset (in which we should not touch the other context on the other
> rings). Then once the GPU is running again, it will hang on the other
> ring and we will detect it and start the blame game all over again.
> We do have a fairness issue whereby a sequence of bad batches on one
> ring may prevent us detecting a hang on the other - but if we have replay
> working, then we carry over the hangscore as well and so the blame
> should be fairly appropriated.

I agree, in the end when two batches hang the gpu on different rings, both
should get blamed. And as opposed to two independent batches being
submitted to the same ring where both hang (but the 2nd one could be
victimized by the first) I can't come up with a reasonable scenario where
we can't do this.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/intel-gfx