On Tue, Jan 28, 2014 at 01:16:34PM +0200, Mika Kuoppala wrote: > Hi, > > I am working with a patchset [1] which, originally, aimed to fix > how we find out the guilty batches with ppgtt. > > But during the review it became clear that I don't have a clear > idea how the behaviour should be when multiple rings encounter > a problematic batch at the same time. > > The following i-g-t patch will add test which asserts that > both contexts get blame of having (problematic) batch active > during hang. > > The patch set [1] will fail with this test case as it will > blame only the first context that injected the hang. > We would need to change the test to for it to pass: > - assert_reset_status(fd[1], 0, RS_BATCH_ACTIVE); > + assert_reset_status(fd[1], 0, RS_BATCH_PENDING); > > I lean towards that both contexts get their batch_active count > increased. As other rings might gain contexts and we could > already reset individual rings instead of whole GPU. > > But we need to take a pick so thats why the RFC. > Thoughts? Assuming idealised code, both get blamed today. Which gets blamed first is decided at random (whichever accumulates hangscore quickest), that triggers either a full GPU reset and replay of unaffected batches, or a ring reset (in which we should not touch the other context on the other rings). Then once the GPU is running again, it will hang on the other ring and we will detect it and start the blame game all over again. We do have a fairness issue whereby a sequence of bad batches on one ring may prevent us detecting a hang on the other - but if we have replay working, then we carry over the hangscore as well and so the blame should be fairly appropriated. -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx