[PATCH] drm/i915: Close race between processing unpin task and queueing the flip

daniel at ffwll.ch (Daniel Vetter) · Sun, 2 Dec 2012 02:15:23 +0100

On Sat, Dec 1, 2012 at 11:32 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote:
> On Sat, 1 Dec 2012 21:35:21 +0100, Daniel Vetter <daniel at ffwll.ch> wrote:
>> On Sat, Dec 01, 2012 at 05:48:50PM +0000, Chris Wilson wrote:
>> > Before queuing the flip but crucially after attaching the unpin-work to
>> > the crtc, we continue to setup the unpin-work. However, should the
>> > hardware fire early, we see the connected unpin-work and queue the task.
>> > The task then promptly runs and unpins the fb before we finish taking
>> > the required references or even pinning it... Havoc.
>> >
>> > To close the race, we use the flip-pending atomic to indicate when the
>> > flip is finally setup and enqueued. So during the flip-done processing,
>> > we can check more accurately whether the flip was expected.
>> >
>> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>>
>> Hm, can't this logic race?
>>
>> - emit the MI_FLIP
>>
>> - flip irq happens because the gpu is idle and completes it right away
>> (or our thread is preempted), work->pending increments from 0 -> 1
>>
>> - queue_flip sets work->pending to 1
>
> -> write RING_TAIL, flush the commands to CS, begin execution of MI_FLIP

Yeah, that should be the normal course of events where the MI_FLIP
gets executed after we set work->pending to 1 (and after all the stuff
has been done). The race I see is that the real MI_FLIP (not a
spurious one this patch defends against) happens before we set
work->pending to 1, so that we essentially lose the increment to 2 and
so block any further flips on this crtc (or modesets for the matter,
once the finish_fb stuff is fixed) indefinitely.

Iow I think it's a bit too good at preventing unpins ;-)

> I'm not happy with the explanation, but I could reliably (100%) hit the
> race whilst loading a 2+GiB image using eog under compiz on an 965gm
> with only 2GIB of ram. As soon as it hit kswapd, the system would OOPS
> with an unpin leak. Which means that was a flip pending/done prior to
> the pinning + MI_FLIP. This patch adds a strong defence against that
> spurious flip done, but doesn't explain where it came from.

Hm, I have no idea how that could cause the spurious flip - the most
likely cause is that something introduces a nice delay somewhere
(through kswapd), but I don't really see how that can happen. I guess
I need to write a flip vs. swapping test. Was the swap due to
unrelated memory pressue, or due to our own gem objects?
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch