Re: Funky new vblank counter regressions in Linux 4.4-rc1

Ville Syrjälä <ville.syrjala@xxxxxxxxxxxxxxx> · Fri, 20 Nov 2015 17:34:35 +0200

On Fri, Nov 20, 2015 at 04:24:50PM +0100, Mario Kleiner wrote:
> 
> 
> On 11/19/2015 08:45 PM, Ville Syrjälä wrote:
> > On Thu, Nov 19, 2015 at 08:12:24PM +0100, Mario Kleiner wrote:
> >> On 11/19/2015 07:20 PM, Ville Syrjälä wrote:
> >>> On Thu, Nov 19, 2015 at 06:46:28PM +0100, Mario Kleiner wrote:
> >>>> Hi Alex and Michel and Ville,
> >>>>
> >>>> it's "fix vblank stuff" time again ;-)
> >>>>
> >>>> Ville's changes to the DRM's drm_handle_vblank() /
> >>>> drm_update_vblank_count() code in Linux 4.4 not only made that code more
> >>>> elegant, but also removed the robustness against the vblank irq quirks
> >>>> in AMD hw and similar hardware. So now i get tons of off-by-one errors and
> >>>>
> >>>> "[   432.345] (WW) RADEON(1): radeon_dri2_flip_event_handler: Pageflip
> >>>> completion event has impossible msc 24803 < target_msc 24804" XOrg
> >>>> messages from that kernel.
> >>>
> >>> Argh. Sorry about that.
> >>>
> >>
> >> On the plus side, your "vblank timestamp deltas as fake vblank counters"
> >> code seems to work nicely on nouveau-kms, as far as testing with three
> >> Nvidia's went so far :). And both Intel gpu's (HD Ironlake, and
> >> Ivybridge) i tested checked out nicely.
> >>
> >> And at least the recent nv50+ NVidia Tesla also have 16 bit vblank
> >> counters which we could implement in nouveau, maybe with the same
> >> trickery to allow long trouble-free vblank off periods, hopefully that
> >> would also apply to the Tegra-4 and later Kepler based parts. Tegra-3
> >> will probably also work. I think i read in the Tegra-3 PRM that the sync
> >> points they use to implement vblank counters do increment at leading
> >> edge of vblank.
> >>
> >> The only problem we may have is that some of the embedded gpus may not
> >> have compliant vblank counters and they probably also lack vblank
> >> timestamping, so it might be a good idea to rather not use vblank
> >> counters at all in those drivers - patch their kms drivers to
> >> max_vblank_count = 0;
> >>
> >>>>
> >>>> One of the reasons for trouble is that AMD hw quirk where the hw fires
> >>>> an extra vblank irq shortly after vblank irq's get enabled, not
> >>>> synchronized to vblank, but typically in the middle of active scanout,
> >>>> so we get a redundant call to drm_handle_vblank in the middle of scanout.
> >>>
> >>> I think that should be fine as such. The code should ignore redudntant
> >>> vbl irqs. Well, assuming you have a reliable hw counter or you use the
> >>> timestamp guesstimate mechanism and your scanout position is reported
> >>> accurately. But I guess you have a bit of problem with both.
> >>>
> >>
> >> The problem is i'll need to treat calls to radeon kms
> >> driver->get_vblank_counter differently, depending if the function gets
> >> called from vblank irq, or from regular code, so that hw quirk that
> >> causes spontaneous misfiring of the vblank irq in the middle of scanout
> >> would confuse my hw vblank counter cooking method to produce a fake hw
> >> vblank counter increment. That's why i moved the filtering for redundant
> >> irqs based on vblank timestamps in drm_vblank_update() around to always
> >> apply. Makes us robust against that type of hw quirk in general and
> >> makes life for the vblank counter cooking so much easier.
> >>
> >> It's a beautiful collaboration of different hw bugs to make things
> >> interesting :)
> >>
> >>>>
> >>>> To fix that i have a minor patch to make drm_update_vblank_count() again
> >>>> robust against such redundant calls, which i will send out later to the
> >>>> mailing list. Diff attached for reference.
> >>>>
> >>>> The second quirk of AMD hw is that the vblank interrupt fires a few
> >>>> scanlines before start of vblank, so drm_handle_vblank ->
> >>>> drm_update_vblank_count() -> dev->driver->get_vblank_counter() gets
> >>>> called before the start of the vblank for which the new vblank count
> >>>> should be queried.
> >>>
> >>> Does it fire too soon, or is the scanout position register value(s)
> >>> just offset by a few lines perhaps?
> >>>
> >>> We have that with i915 and I simply fix up the value when reading it
> >>> out. Fortunately for us the offset is constant (or at least seems to
> >>> be) for a given platform/connector combo.
> >>>
> >>
> >> I think they fire too soon, from all i've seen so far on a few cards.
> >
> > That's unfortunate. Firing a bit too late would be perfectly fine for
> > most things. And that's actually what happens on older intel hw. Firing
> > too soon opens up some more races, as in you may have to wait a bit
> > more after getting woken up to make sure you've crossed into the
> > freame you were waiting for.
> >
> > Also not sure how to deal with that sort of thing in a reasonable way in
> > .get_vblank_timestamp(). DRM_CALLED_FROM_VBLIRQ gets passed there, so it
> > could fudge things a bit when the early irq arrives, but then if someone
> > calls drm_update_vblank_count() after the irq was handled but before
> > start of vblank you'll end up with a -1 diff.
> >
> > Maybe something like:
> > .get_scanout_positon()
> > {
> > 	read frame counter and scaline position
> >
> > 	if ((flags & DRM_CALLED_FROM_VBLIRQ || frame == last_fudge_frame) &&
> > 	     scanline_slightly_below_start_of_vblank)) {
> > 		last_fudge_frame = frame;
> > 		scanline = start_of_vblank;
> > 	} else
> > 		last_fudge_frame = frame - 1;
> > }
> >
> > could be done. So it would pretend the scanline has already crossed into
> > the vblank when the interrupt arrives, and it should keep doing that
> > until the frame counter indicates that you really crossed there.
> >
> 
> What we do in radeon-kms is similar. If DRM_CALLED_FROM_VBLIRQ and we 
> are no more than 1% of the display height away from start of vblank we 
> fudge scanout position in a way so that the timestamp gets computed for 
> the soon-to-begin vblank, not the old one.

The problem with basing that fudging purely on DRM_CALLED_FROM_VBLIRQ is
that if you call .get_scanout_positon() from a non-irq context between
the irq firing and start of vblank, you'll think you're still in the
previous frame. Hence my suggestion to note down the frame counter when
called from the irq, and then keep doing the fudging until you've truly
crossed the start of vblank.

> For the vblank counter 
> cooking atm. i check for in_irq(), assuming that means the same as 
> DRM_CALLED_FROM_VBLIRQ, given that is currently the only call from a hw 
> irq path, and then i test scanout position for 
> no-more-than-10-scanlines-to-vblank, to decide if i have to bump the 
> returned vblank count. On the sample of gpus i tested the vblank irq 
> fired somewhere between 2 and 4 scanlines too early, so 10 scanlines 
> margin should be ok. So it's not firing much too early, just a little bit.
> 
> >>
> >>>>
> >>>> The third problem is that the DRM vblank handling always had the
> >>>> assumption that hardware vblank counters would essentially increment at
> >>>> leading edge of vblank - basically in sync with the firing of the vblank
> >>>> irq, so that a hw counter readout from within the vblank irq handler
> >>>> would always deliver the new incremented value. If this assumption is
> >>>> violated then the counting by use of the hw counter gets unreliable,
> >>>> because depending on random small delays in irq handling the code may
> >>>> end up sampling the hw counter pre- or post-increment, leading to
> >>>> inconsistent updating and funky bugs. It just so happens that AMD
> >>>> hardware doesn't increment the hw counter at leading edge of vblank, so
> >>>> stuff falls apart.
> >>>>
> >>>> So to fix those two problems i'm tinkering with cooking the hw vblank
> >>>> counter value returned by radeon_get_vblank_counter_kms() to make it
> >>>> appear as if the counter incremented at leading edge of vblank in sync
> >>>> with vblank irq.
> >>>
> >>> Yeah, I do that in i915 too. Well, on some platforms where the
> >>> counter increments at the leading edge of active.
> >>>
> >>
> >> Yep. I think that's key to make the vblank instant off stuff work reliably.
> >
> > I think it should work OK with the timestamp based stuff too, assuming
> > you have accurate scanout position reporting. So having a reliable hw
> > counter shouldn't be a requirement for that.
> >
> 
> Yes, should work for the instant off/on itself. But the limitation of 
> that is that with the timestamp based stuff we can only be accurate for 
> vblank off periods of a few seconds into the future. The framedur_ns we 
> calculate from hw mode and the refresh duration computed from the 
> timestamps as measured by the host clock can easily deviate by, e.g., 10 
> usecs due to difference and drift between the gpu pll and host clock, so 
> for each passing refresh cycle with vblank off we'd accumulate ~10 usecs 
> of error. After accumulating half a refresh duration of error we'd round 
> to the wrong count and be off by one. So at 60 Hz refresh we'd have 
> about 16666 usecs / 2 = 8333 usecs headroom / 10 usecs per frame error ~ 
> 833 refresh cycles before errors happen. At 60 Hz we'd have 833 / 60 ~ 
> 14 seconds of vblank off. For rates of 120 Hz gamer panels or some 200 
> Hz neuro-science applications we'd only get a few seconds off error free 
> vblank off without real hardware counters.
> 
> So it's good enough for typical desktop 
> applications/compositors/games/media players, and a nice improvement 
> over the previous state, but not quite sufficient for applications that 
> need long time consistent vblank counts for long waits of multiple 
> seconds or even minutes. So it's still good to have hw counters if we 
> can get some that are reliable enough.

Ah, I didn't realize you care about small errors in the counter for
such long periods of vblank off.

> 
> -mario
> 
> 
> >>
> >> -mario
> >>
> >>>>
> >>>> It almost sort of works on the rs600 code path, but i need a bit of info
> >>>> from you:
> >>>>
> >>>> 1. There's this register from the old specs for m76.pdf, which is not
> >>>> part of the current register defines for radeon-kms:
> >>>>
> >>>> "D1CRTC_STATUS_VF_COUNT - RW - 32 bits - [GpuF0MMReg:0x60A8]"
> >>>>
> >>>> It contains the lower 16 bits of framecounter and the 13 bits of
> >>>> vertical scanout position. It seems to give the same readings as the 24
> >>>> bit R_0060A4_D1CRTC_STATUS_FRAME_COUNT we use for the hw counter. This
> >>>> would come handy.
> >>>>
> >>>> Does Evergreen and later have a same/similar register and where is it?
> >>>>
> >>>> 2. The hw framecounter seems to increment when the vertical scanout
> >>>> position wraps back from (VTOTAL - 1) to 0, at least on the one DCE-3
> >>>> gpu i tested so far. Is this so on all asics? And is the hw counter
> >>>> increment happening exactly at the moment that vertical scanout position
> >>>> jumps back to zero, ie. both events are driven by the same signal? Or is
> >>>> the framecounter increment just happening somewhere inside either
> >>>> scanline VTOTAL-1 or scanline 0?
> >>>>
> >>>>
> >>>> If we can fix this and get it into rc2 or rc3 then we could avoid a bad
> >>>> regression and with a bit of luck at the same time improve by being able
> >>>> to set dev->vblank_disable_immediate = true then and allow vblank irqs
> >>>> to get turned off more aggressively for a bit of extra power saving.
> >>>>
> >>>> thanks,
> >>>> -mario
> >>>
> >>>> -- a/drivers/gpu/drm/drm_irq.c
> >>>> +++ b/drivers/gpu/drm/drm_irq.c
> >>>> @@ -172,9 +172,11 @@ static void drm_update_vblank_count(struct drm_device *dev, unsigned int pipe,
> >>>>                                       unsigned long flags)
> >>>>    {
> >>>>           struct drm_vblank_crtc *vblank = &dev->vblank[pipe];
> >>>> -       u32 cur_vblank, diff;
> >>>> +       u32 cur_vblank, diff = 0;
> >>>>           bool rc;
> >>>>           struct timeval t_vblank;
> >>>> +       const struct timeval *t_old;
> >>>> +       u64 diff_ns;
> >>>>           int count = DRM_TIMESTAMP_MAXRETRIES;
> >>>>           int framedur_ns = vblank->framedur_ns;
> >>>>
> >>>> @@ -195,13 +197,15 @@ static void drm_update_vblank_count(struct drm_device *dev, unsigned int pipe,
> >>>>                   rc = drm_get_last_vbltimestamp(dev, pipe, &t_vblank, flags);
> >>>>           } while (cur_vblank != dev->driver->get_vblank_counter(dev, pipe) && --count > 0);
> >>>>
> >>>> -       if (dev->max_vblank_count != 0) {
> >>>> -               /* trust the hw counter when it's around */
> >>>> -               diff = (cur_vblank - vblank->last) & dev->max_vblank_count;
> >>>> -       } else if (rc && framedur_ns) {
> >>>> -               const struct timeval *t_old;
> >>>> -               u64 diff_ns;
> >>>> -
> >>>> +       /*
> >>>> +        * Always use vblank timestamping based method if supported to reject
> >>>> +        * redundant vblank irqs. E.g., AMD hardware needs this to not screw up
> >>>> +        * due to some irq handling quirk.
> >>>
> >>> Hmm. I'm thinking it would be better to simply not claim that there's a hw
> >>> counter if it isn't reliable.
> >>>
> >>>> +        *
> >>>> +        * This also sets the diff value for use as fallback below in case the
> >>>> +        * hw does not support a suitable hw vblank counter.
> >>>> +        */
> >>>> +       if (rc && framedur_ns) {
> >>>>                   t_old = &vblanktimestamp(dev, pipe, vblank->count);
> >>>>                   diff_ns = timeval_to_ns(&t_vblank) - timeval_to_ns(t_old);
> >>>>
> >>>> @@ -212,11 +216,28 @@ static void drm_update_vblank_count(struct drm_device *dev, unsigned int pipe,
> >>>>                    */
> >>>>                   diff = DIV_ROUND_CLOSEST_ULL(diff_ns, framedur_ns);
> >>>>
> >>>> -               if (diff == 0 && flags & DRM_CALLED_FROM_VBLIRQ)
> >>>> -                       DRM_DEBUG_VBL("crtc %u: Redundant vblirq ignored."
> >>>> -                                     " diff_ns = %lld, framedur_ns = %d)\n",
> >>>> -                                     pipe, (long long) diff_ns, framedur_ns);
> >>>> -       } else {
> >>>> +               if (diff == 0 && flags & DRM_CALLED_FROM_VBLIRQ) {
> >>>> +                   DRM_DEBUG_VBL("crtc %u: Redundant vblirq ignored."
> >>>> +                   " diff_ns = %lld, framedur_ns = %d)\n",
> >>>> +                   pipe, (long long) diff_ns, framedur_ns);
> >>>> +
> >>>> +                   /* Treat this redundant invocation as no-op. */
> >>>> +                   WARN_ON_ONCE(cur_vblank != vblank->last);
> >>>> +                   return;
> >>>> +               }
> >>>> +       }
> >>>> +
> >>>> +       /*
> >>>> +        * hw counters, if marked as supported via max_vblank_count != 0,
> >>>> +        * *must* increment at leading edge of vblank and in sync with
> >>>> +        * the firing of the hw vblank irq, otherwise we have a race here
> >>>> +        * between gpu and us, where small variations in our execution
> >>>> +        * timing can cause off-by-one miscounting of vblanks.
> >>>> +        */
> >>>> +       if (dev->max_vblank_count != 0) {
> >>>> +               /* trust the hw counter when it's around */
> >>>> +               diff = (cur_vblank - vblank->last) & dev->max_vblank_count;
> >>>> +       } else if (!(rc && framedur_ns)) {
> >>>>                   /* some kind of default for drivers w/o accurate vbl timestamping */
> >>>>                   diff = (flags & DRM_CALLED_FROM_VBLIRQ) != 0;
> >>>>           }
> >>>
> >>>
> >

-- 
Ville Syrjälä
Intel OTC
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/dri-devel