Hi Russell, Am Dienstag, den 19.06.2018, 10:43 +0100 schrieb Russell King - ARM Linux: > It looks like a bug has crept in to etnaviv between 4.16 and 4.17, > which causes etnaviv to misbehave with the GC600 GPU on Dove. I > don't think it's a GPU issue, I think it's a DRM issue. > > I get multiple: > > [ 596.711482] etnaviv-gpu f1840000.gpu: recover hung GPU! > [ 597.732852] etnaviv-gpu f1840000.gpu: GPU failed to reset: FE not idle, 3D not idle, 2D not idle > > while Xorg is starting up. Ignore the "failed to reset", that > just seems to be a property of the GC600, and of course is a > subsequent issue after the primary problem. > > Looking at the devcoredump: > > 00000004 = 000000fe Idle: FE- DE+ PE+ SH+ PA+ SE+ RA+ TX+ VG- IM- FP- TS- > > So, all units on the GC600 were idle except for the front end. > > 00000660 = 00000812 Cmd: [wait DMA: idle Fetch: valid] Req idle Cal idle > 00000664 = 102d06d8 Command DMA address > 00000668 = 380000c8 FE fetched word 0 > 0000066c = 0000001f FE fetched word 1 > > The front end was basically idle at this point, at a WAIT 200 command. > Digging through the ring: > > 00688: 08010e01 00000040 LDST 0x3804=0x00000040 > 00690: 40000002 102d06a0 LINK 0x102d06a0 > 00698: 40000002 102d0690 LINK 0x102d0690 > 006a0: 08010e04 0000001f LDST 0x3810=0x0000001f > 006a8: 40000025 102d3000 LINK 0x102d3000 > 006b0: 08010e03 00000008 LDST 0x380c=0x00000008 Flush PE2D > 006b8: 08010e02 00000701 LDST 0x3808=0x00000701 SEM FE -> PE > 006c0: 48000000 00000701 STALL FE -> PE > 006c8: 08010e01 00000041 LDST 0x3804=0x00000041 > 006d0: 380000c8(0000001f) WAIT 200 > > 006d8: 40000002 102d06d0 LINK 0x102d06d0 <=========== > > We've basically come to the end of the currently issued command stream > and hit the wait-link loop. Everything else in the devcoredump looks > normal. > > So, I think etnaviv DRM has missed an event signalled from the GPU. I don't see what would make us miss a event suddenly. > This worked fine in 4.16, so seems to be a regression. The only thing that comes to mind is that with the DRM scheduler we enforce a job timeout of 500ms, without the previous logic to allow a job to run indefinitely as long as it makes progress, as this is a serious QoS issue. This might bite you at this point, if Xorg manages to submit a really big job. The coredump might be delayed enough that it captures the state of the GPU when it has managed to finish the job after the job timeout was hit. Can you try if changing the timeout value to something large in drm_sched_init() in etnaviv_sched.c makes any difference? Regards, Lucas _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/dri-devel