On Tue, Jun 19, 2018 at 12:09:16PM +0200, Lucas Stach wrote: > Hi Russell, > > Am Dienstag, den 19.06.2018, 10:43 +0100 schrieb Russell King - ARM Linux: > > It looks like a bug has crept in to etnaviv between 4.16 and 4.17, > > which causes etnaviv to misbehave with the GC600 GPU on Dove. I > > don't think it's a GPU issue, I think it's a DRM issue. > > > > I get multiple: > > > > [ 596.711482] etnaviv-gpu f1840000.gpu: recover hung GPU! > > [ 597.732852] etnaviv-gpu f1840000.gpu: GPU failed to reset: FE not idle, 3D not idle, 2D not idle > > > > while Xorg is starting up. Ignore the "failed to reset", that > > just seems to be a property of the GC600, and of course is a > > subsequent issue after the primary problem. > > > > Looking at the devcoredump: > > > > 00000004 = 000000fe Idle: FE- DE+ PE+ SH+ PA+ SE+ RA+ TX+ VG- IM- FP- TS- > > > > So, all units on the GC600 were idle except for the front end. > > > > 00000660 = 00000812 Cmd: [wait DMA: idle Fetch: valid] Req idle Cal idle > > 00000664 = 102d06d8 Command DMA address > > 00000668 = 380000c8 FE fetched word 0 > > 0000066c = 0000001f FE fetched word 1 > > > > The front end was basically idle at this point, at a WAIT 200 command. > > Digging through the ring: > > > > 00688: 08010e01 00000040 LDST 0x3804=0x00000040 > > 00690: 40000002 102d06a0 LINK 0x102d06a0 > > 00698: 40000002 102d0690 LINK 0x102d0690 > > 006a0: 08010e04 0000001f LDST 0x3810=0x0000001f > > 006a8: 40000025 102d3000 LINK 0x102d3000 > > 006b0: 08010e03 00000008 LDST 0x380c=0x00000008 Flush PE2D > > 006b8: 08010e02 00000701 LDST 0x3808=0x00000701 SEM FE -> PE > > 006c0: 48000000 00000701 STALL FE -> PE > > 006c8: 08010e01 00000041 LDST 0x3804=0x00000041 > > 006d0: 380000c8(0000001f) WAIT 200 > > > 006d8: 40000002 102d06d0 LINK 0x102d06d0 <=========== > > > > We've basically come to the end of the currently issued command stream > > and hit the wait-link loop. Everything else in the devcoredump looks > > normal. > > > > So, I think etnaviv DRM has missed an event signalled from the GPU. > > I don't see what would make us miss a event suddenly. > > > This worked fine in 4.16, so seems to be a regression. > > The only thing that comes to mind is that with the DRM scheduler we > enforce a job timeout of 500ms, without the previous logic to allow a > job to run indefinitely as long as it makes progress, as this is a > serious QoS issue. That is probably what's going on then - the GC600 is not particularly fast when dealing with 1080p resolutions. I think what your commit to use the DRM scheduler is missing is the progress detection in the original scheme - we used to assume that if the GPU FE DMA address had progressed, that the GPU was not hung. Now it seems we merely do this by checking for events. > This might bite you at this point, if Xorg manages to submit a really > big job. The coredump might be delayed enough that it captures the > state of the GPU when it has managed to finish the job after the job > timeout was hit. No, it's not "a really big job" - it's just that the Dove GC600 is not fast enough to complete _two_ 1080p sized GPU operations within 500ms. The preceeding job contained two blits - one of them a non-alphablend copy of: 00180000 04200780 0,24,1920,1056 -> 0,24,1920,1056 and one an alpha blended copy of: 00000000 04380780 0,0,1920,1080 -> 0,0,1920,1080 This is (iirc) something I already fixed with the addition of the progress detection back before etnaviv was merged into the mainline kernel. -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up According to speedtest.net: 8.21Mbps down 510kbps up _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/dri-devel