On Sat, Dec 1, 2012 at 11:32 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote: > On Sat, 1 Dec 2012 21:35:21 +0100, Daniel Vetter <daniel at ffwll.ch> wrote: >> On Sat, Dec 01, 2012 at 05:48:50PM +0000, Chris Wilson wrote: >> > Before queuing the flip but crucially after attaching the unpin-work to >> > the crtc, we continue to setup the unpin-work. However, should the >> > hardware fire early, we see the connected unpin-work and queue the task. >> > The task then promptly runs and unpins the fb before we finish taking >> > the required references or even pinning it... Havoc. >> > >> > To close the race, we use the flip-pending atomic to indicate when the >> > flip is finally setup and enqueued. So during the flip-done processing, >> > we can check more accurately whether the flip was expected. >> > >> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk> >> >> Hm, can't this logic race? >> >> - emit the MI_FLIP >> >> - flip irq happens because the gpu is idle and completes it right away >> (or our thread is preempted), work->pending increments from 0 -> 1 >> >> - queue_flip sets work->pending to 1 > > -> write RING_TAIL, flush the commands to CS, begin execution of MI_FLIP Yeah, that should be the normal course of events where the MI_FLIP gets executed after we set work->pending to 1 (and after all the stuff has been done). The race I see is that the real MI_FLIP (not a spurious one this patch defends against) happens before we set work->pending to 1, so that we essentially lose the increment to 2 and so block any further flips on this crtc (or modesets for the matter, once the finish_fb stuff is fixed) indefinitely. Iow I think it's a bit too good at preventing unpins ;-) > I'm not happy with the explanation, but I could reliably (100%) hit the > race whilst loading a 2+GiB image using eog under compiz on an 965gm > with only 2GIB of ram. As soon as it hit kswapd, the system would OOPS > with an unpin leak. Which means that was a flip pending/done prior to > the pinning + MI_FLIP. This patch adds a strong defence against that > spurious flip done, but doesn't explain where it came from. Hm, I have no idea how that could cause the spurious flip - the most likely cause is that something introduces a nice delay somewhere (through kswapd), but I don't really see how that can happen. I guess I need to write a flip vs. swapping test. Was the swap due to unrelated memory pressue, or due to our own gem objects? -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch