Re: Funky new vblank counter regressions in Linux 4.4-rc1

Mario Kleiner <mario.kleiner.de@xxxxxxxxx> · Wed, 25 Nov 2015 20:04:26 +0100

On 11/25/2015 06:46 PM, Ville Syrjälä wrote:
On Wed, Nov 25, 2015 at 06:24:13PM +0100, Mario Kleiner wrote:
On 11/23/2015 09:24 PM, Ville Syrjälä wrote:
On Mon, Nov 23, 2015 at 06:58:34PM +0100, Mario Kleiner wrote:


On 11/23/2015 04:51 PM, Ville Syrjälä wrote:
On Mon, Nov 23, 2015 at 04:23:21PM +0100, Mario Kleiner wrote:
On 11/20/2015 04:34 PM, Ville Syrjälä wrote:
On Fri, Nov 20, 2015 at 04:24:50PM +0100, Mario Kleiner wrote:

...
Ok, but why would that be a bad thing? I think we want it to think it is
in the previous frame if it is called outside the vblank irq context.
The only reason we fudge it to the next frames vblank if i vblank irq is
because we know the vblank irq handler we are executing atm. was meant
to execute within the upcoming vblank for the next frame, so we fudge
the scanout positions and thereby timestamp to correspond to that new
frame. But if something called outside irq context it should get a
scanout position/timestamp that corresponds to "reality".

It would be a bad thing since it would cause the timestamp to jump
backwards, and that would also cause the frame count guesstimate to go
backwards.


But only if we don't use the dev->driver->get_vblank_counter() method,
which we try to use on AMD.

Well, if you do it that way then you have the problem of the hw counter
seeming to jump forward by one after crossing the start of vblank (when
compared to the value you sampled when you processed the early vblank
interrupt).


Ok, finally i see the bad scenario that wouldn't get prevented by our
current locking with the new vblank counting in the core. The vblank
enable path is safe due to locking and discounting of redundant
timestamps etc. But the disable path could go wrong:

1. Vblank irq fires, drm_handle_vblank() -> drm_update_vblank_count(),
updates timestamps and counts "as if" in vblank -> incremented vblank
count and timestamp now set in the future.

2. After vblank irq finishes, but just before leading edge of vblank,
vblank_disable_and_save() executes, doesn't get bumped timestamp or
count because before vblank and not in vblank irq. Now
drm_update_vblank_count() would process a
"new" timestamp and count from the past and we'd have time and counts
going backwards, and bad things would happen.

I haven't observed such a thing happening during testing so far,
probably because the time window in which it could happen is tiny, but
given how awfully bad it would be, it needs to be prevented.

I had a look at the description of the Vblank irq in the "M76 Register
Reference Guide" for older asics and the description suggests that the
vblank irq fires when the crtc's line buffer is finished reading pixel
data from the scanout buffer in memory for a frame, ie., when the line
buffer read "enters" vblank. That would explain why the irq happens a
few scanlines before actual vblank, because line buffer refills must
obviously happen before the crtc can send pixel data from the line
buffer to the encoders, so it would lead a bit in time. That also means
we can't delay the vblank irq to actually happen at start of vblank and
have to deal with the early vblank irq.

I guess one silly idea would be to defer the vblank interrupt processing
to a timer, and just schedule it a bit into the future from the actual
interrupt handler.


Timer would be bad because then we get problems with the pageflip
completion irq sometimes being processed before the vblank irq,

You you'd need to move page flip completion to happen from the vblank
timer too I suppose.

and
because we want to be fast in vblank irq handling, delivering vblank
events etc. I wouldn't trust a timer to be reliable enough for such
short waits.

hrtimers should be accurate. But maybe more expensive than the timer
wheel.


Sounds all a bit complex and fraught with new possible complications. I 
can't spend much more time on this than i did so far, and we need to get 
this into 4.4-rc, so i am aiming for the most simple solution that would 
work.

Busy waiting wouldn't be great either in irq.

So what about this relatively simple one?

1. In radeon_get_crtc_scanoutpos() we artifically define the
vblank_start line to be, e.g, 5 scanlines before the true start of
vblank, so for the purpose of vblank counter queries and timestamping
our "vblank" would start a bit earlier and the vblank irq would always
execute in "vblank". Non-Irq invocations like vblank_disable_and_save()
would also be treated to this early vblank start and so what the DRM
core observes would always be consistent.

2. At least in radeon-kms we also use radeon_get_crtc_scanoutpos()
internally for "dynpm" dynamic power management/reclocking, and to
implement pageflip completion detection on asics older than DCE3 which
don't have pageflip interrupts. For those cases we need to use the true
start of vblank, so for this internal use we pass in some special flag
into radeon_get_crtc_scanoutpos() to tell it to not shift the vblank
start around.

3. I've added another check to my patch for drm_update_vblank_count() to
catch timestamps going backwards and discounting such cases, for a bit
of extra robustness against driver trouble.

Any ideas why this would be a stupid idea? I'll try this now and just
hope meanwhile nobody finds a reason why this would be bad.

What I don't like is leaking any of this into the core code. All the
hacks should live in the hw specific code. We've managed to keep all
the i915 hacks hidden that way.

So if you would:
- fudge your get_scanout_position as you suggested
- _not_ expose the hw counter to the vblank core since it
   disagrees with the scanout position
- keep the internal get_scanout_position variant/flag
   purely internal

then I think I might like it. That way working hardware doesn't have to
be exposed to these hacks, and there's possible a bit less danger that it
all gets broken again next time the core needs some work.


Ok. Exposing the fudged hw counter shouldn't be a problem though, given 
that it would be fudged in a consistent way with the fudged scanout 
positions and to have incremented at time of drm_handle_vblank(). We'd 
bump the hw counter to the count of the new vblank at the same time when 
the scanout positions would start counting backwards from minus 
something to 0, showing how many vblank lines to go until start of 
scanout, etc. The only difference to reality would be that this 
simulated vblank would start 5 scanlines earlier than the true hw 
vblank, but i can't see how the core would care about that.


One problem with that approach could be that the vblank event and page
flip event would be delievered at different times if you keep using the
page flip interrupt, but I'm not sure that would be a problem. At least
they should both have the same timestamp and counter value.


That's the same now, and the timestamps and counts be the same. I'll 
check with my measurement equipment that the timestamps will be still 
accurate wrt. to reality.

Attached is my current patch i wanted to submit for the drm core's 
drm_update_vblank_count(). I think it's good to make the core somewhat 
robust against potential kms driver bugs or glitches. But if you 
wouldn't like that patch, there wouldn't be much of a point sending it 
out at all.

thanks,
-mario

>From 2d5d58a1c575ad002ce2cb643f395d0e4757d959 Mon Sep 17 00:00:00 2001
From: Mario Kleiner <mario.kleiner.de@xxxxxxxxx>
Date: Wed, 25 Nov 2015 18:48:31 +0100
Subject: [PATCH] drm/irq: Make drm_update_vblank_count() more robust.

The changes to drm_update_vblank_count() for Linux 4.4-rc
made the function more fragile wrt. some hw quirks. E.g.,
at dev->driver->enable_vblank(), AMD gpu's fire a spurious
redundant vblank irq shortly after enabling vblank irqs, not
locked to vblank. This causes a redundant call which needs
to be suppressed to avoid miscounting.

To increase robustness, shuffle things around a bit:

On drivers with high precision vblank timestamping always
evaluate the timestamp difference between current timestamp
and previously recorded timestamp to detect such redundant
invocations and no-op in that case.

Also detect and warn about timestamps going backwards to
catch potential kms driver bugs.

This patch is meant for Linux 4.4-rc and later.

Signed-off-by: Mario Kleiner <mario.kleiner.de@xxxxxxxxx>
---
 drivers/gpu/drm/drm_irq.c | 53 ++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 41 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/drm_irq.c b/drivers/gpu/drm/drm_irq.c
index 819b8c1..8728c3c 100644
--- a/drivers/gpu/drm/drm_irq.c
+++ b/drivers/gpu/drm/drm_irq.c
@@ -172,9 +172,11 @@ static void drm_update_vblank_count(struct drm_device *dev, unsigned int pipe,
 				    unsigned long flags)
 {
 	struct drm_vblank_crtc *vblank = &dev->vblank[pipe];
-	u32 cur_vblank, diff;
+	u32 cur_vblank, diff = 0;
 	bool rc;
 	struct timeval t_vblank;
+	const struct timeval *t_old;
+	u64 diff_ns;
 	int count = DRM_TIMESTAMP_MAXRETRIES;
 	int framedur_ns = vblank->framedur_ns;
 
@@ -195,13 +197,15 @@ static void drm_update_vblank_count(struct drm_device *dev, unsigned int pipe,
 		rc = drm_get_last_vbltimestamp(dev, pipe, &t_vblank, flags);
 	} while (cur_vblank != dev->driver->get_vblank_counter(dev, pipe) && --count > 0);
 
-	if (dev->max_vblank_count != 0) {
-		/* trust the hw counter when it's around */
-		diff = (cur_vblank - vblank->last) & dev->max_vblank_count;
-	} else if (rc && framedur_ns) {
-		const struct timeval *t_old;
-		u64 diff_ns;
-
+	/*
+	 * Always use vblank timestamping based method if supported to reject
+	 * redundant vblank irqs. E.g., AMD hardware needs this to not screw up
+	 * due to some irq handling quirk.
+	 *
+	 * This also sets the diff value for use as fallback below in case the
+	 * hw does not support a suitable hw vblank counter.
+	 */
+	if (rc && framedur_ns) {
 		t_old = &vblanktimestamp(dev, pipe, vblank->count);
 		diff_ns = timeval_to_ns(&t_vblank) - timeval_to_ns(t_old);
 
@@ -212,11 +216,36 @@ static void drm_update_vblank_count(struct drm_device *dev, unsigned int pipe,
 		 */
 		diff = DIV_ROUND_CLOSEST_ULL(diff_ns, framedur_ns);
 
-		if (diff == 0 && flags & DRM_CALLED_FROM_VBLIRQ)
+		/* Catch driver timestamping bugs and prevent worse. */
+		if ((s32) diff < 0) {
+			DRM_DEBUG_VBL("crtc %u: Timestamp going backward!"
+			" diff_ns = %lld, framedur_ns = %d)\n",
+			pipe, (long long) diff_ns, framedur_ns);
+			return;
+		}
+
+		if (diff == 0 && flags & DRM_CALLED_FROM_VBLIRQ) {
 			DRM_DEBUG_VBL("crtc %u: Redundant vblirq ignored."
-				      " diff_ns = %lld, framedur_ns = %d)\n",
-				      pipe, (long long) diff_ns, framedur_ns);
-	} else {
+			" diff_ns = %lld, framedur_ns = %d)\n",
+			pipe, (long long) diff_ns, framedur_ns);
+
+			/* Treat this redundant invocation as no-op. */
+			WARN_ON_ONCE(cur_vblank != vblank->last);
+			return;
+		}
+	}
+
+	/*
+	 * hw counters, if marked as supported via max_vblank_count != 0,
+	 * *must* increment at leading edge of vblank or at least before
+	 * the firing of the hw vblank irq, otherwise we have a race here
+	 * between gpu and us, where small variations in our execution
+	 * timing can cause off-by-one miscounting of vblanks.
+	 */
+	if (dev->max_vblank_count != 0) {
+		/* trust the hw counter when it's around */
+		diff = (cur_vblank - vblank->last) & dev->max_vblank_count;
+	} else if (!(rc && framedur_ns)) {
 		/* some kind of default for drivers w/o accurate vbl timestamping */
 		diff = (flags & DRM_CALLED_FROM_VBLIRQ) != 0;
 	}
-- 
1.9.1

_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/dri-devel