Re: Funky new vblank counter regressions in Linux 4.4-rc1

Ville Syrjälä <ville.syrjala@xxxxxxxxxxxxxxx> · Wed, 25 Nov 2015 19:46:00 +0200

On Wed, Nov 25, 2015 at 06:24:13PM +0100, Mario Kleiner wrote:
> On 11/23/2015 09:24 PM, Ville Syrjälä wrote:
> > On Mon, Nov 23, 2015 at 06:58:34PM +0100, Mario Kleiner wrote:
> >>
> >>
> >> On 11/23/2015 04:51 PM, Ville Syrjälä wrote:
> >>> On Mon, Nov 23, 2015 at 04:23:21PM +0100, Mario Kleiner wrote:
> >>>> On 11/20/2015 04:34 PM, Ville Syrjälä wrote:
> >>>>> On Fri, Nov 20, 2015 at 04:24:50PM +0100, Mario Kleiner wrote:
> >>>>
> >>>> ...
> >>>> Ok, but why would that be a bad thing? I think we want it to think it is
> >>>> in the previous frame if it is called outside the vblank irq context.
> >>>> The only reason we fudge it to the next frames vblank if i vblank irq is
> >>>> because we know the vblank irq handler we are executing atm. was meant
> >>>> to execute within the upcoming vblank for the next frame, so we fudge
> >>>> the scanout positions and thereby timestamp to correspond to that new
> >>>> frame. But if something called outside irq context it should get a
> >>>> scanout position/timestamp that corresponds to "reality".
> >>>
> >>> It would be a bad thing since it would cause the timestamp to jump
> >>> backwards, and that would also cause the frame count guesstimate to go
> >>> backwards.
> >>>
> >>
> >> But only if we don't use the dev->driver->get_vblank_counter() method,
> >> which we try to use on AMD.
> >
> > Well, if you do it that way then you have the problem of the hw counter
> > seeming to jump forward by one after crossing the start of vblank (when
> > compared to the value you sampled when you processed the early vblank
> > interrupt).
> >
> 
> Ok, finally i see the bad scenario that wouldn't get prevented by our 
> current locking with the new vblank counting in the core. The vblank 
> enable path is safe due to locking and discounting of redundant 
> timestamps etc. But the disable path could go wrong:
> 
> 1. Vblank irq fires, drm_handle_vblank() -> drm_update_vblank_count(),
> updates timestamps and counts "as if" in vblank -> incremented vblank 
> count and timestamp now set in the future.
> 
> 2. After vblank irq finishes, but just before leading edge of vblank, 
> vblank_disable_and_save() executes, doesn't get bumped timestamp or 
> count because before vblank and not in vblank irq. Now 
> drm_update_vblank_count() would process a
> "new" timestamp and count from the past and we'd have time and counts 
> going backwards, and bad things would happen.
> 
> I haven't observed such a thing happening during testing so far, 
> probably because the time window in which it could happen is tiny, but 
> given how awfully bad it would be, it needs to be prevented.
> 
> I had a look at the description of the Vblank irq in the "M76 Register 
> Reference Guide" for older asics and the description suggests that the 
> vblank irq fires when the crtc's line buffer is finished reading pixel 
> data from the scanout buffer in memory for a frame, ie., when the line 
> buffer read "enters" vblank. That would explain why the irq happens a 
> few scanlines before actual vblank, because line buffer refills must 
> obviously happen before the crtc can send pixel data from the line 
> buffer to the encoders, so it would lead a bit in time. That also means 
> we can't delay the vblank irq to actually happen at start of vblank and 
> have to deal with the early vblank irq.
> 
> > I guess one silly idea would be to defer the vblank interrupt processing
> > to a timer, and just schedule it a bit into the future from the actual
> > interrupt handler.
> >
> 
> Timer would be bad because then we get problems with the pageflip 
> completion irq sometimes being processed before the vblank irq,

You you'd need to move page flip completion to happen from the vblank
timer too I suppose.

> and 
> because we want to be fast in vblank irq handling, delivering vblank 
> events etc. I wouldn't trust a timer to be reliable enough for such 
> short waits.

hrtimers should be accurate. But maybe more expensive than the timer
wheel.

> Busy waiting wouldn't be great either in irq.
> 
> So what about this relatively simple one?
> 
> 1. In radeon_get_crtc_scanoutpos() we artifically define the 
> vblank_start line to be, e.g, 5 scanlines before the true start of 
> vblank, so for the purpose of vblank counter queries and timestamping 
> our "vblank" would start a bit earlier and the vblank irq would always 
> execute in "vblank". Non-Irq invocations like vblank_disable_and_save() 
> would also be treated to this early vblank start and so what the DRM 
> core observes would always be consistent.
> 
> 2. At least in radeon-kms we also use radeon_get_crtc_scanoutpos() 
> internally for "dynpm" dynamic power management/reclocking, and to 
> implement pageflip completion detection on asics older than DCE3 which 
> don't have pageflip interrupts. For those cases we need to use the true 
> start of vblank, so for this internal use we pass in some special flag 
> into radeon_get_crtc_scanoutpos() to tell it to not shift the vblank 
> start around.
> 
> 3. I've added another check to my patch for drm_update_vblank_count() to 
> catch timestamps going backwards and discounting such cases, for a bit 
> of extra robustness against driver trouble.
> 
> Any ideas why this would be a stupid idea? I'll try this now and just 
> hope meanwhile nobody finds a reason why this would be bad.

What I don't like is leaking any of this into the core code. All the
hacks should live in the hw specific code. We've managed to keep all
the i915 hacks hidden that way.

So if you would:
- fudge your get_scanout_position as you suggested
- _not_ expose the hw counter to the vblank core since it
  disagrees with the scanout position
- keep the internal get_scanout_position variant/flag
  purely internal

then I think I might like it. That way working hardware doesn't have to
be exposed to these hacks, and there's possible a bit less danger that it
all gets broken again next time the core needs some work.

One problem with that approach could be that the vblank event and page
flip event would be delievered at different times if you keep using the
page flip interrupt, but I'm not sure that would be a problem. At least
they should both have the same timestamp and counter value.

-- 
Ville Syrjälä
Intel OTC
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/dri-devel