Re: [RFC] Reduce idle vblank wakeups

Mario Kleiner <mario.kleiner@xxxxxxxxxxxxxxxx> · Thu, 17 Nov 2011 21:36:23 +0100

On Nov 17, 2011, at 3:19 AM, Matthew Garrett wrote:

On Thu, Nov 17, 2011 at 01:26:37AM +0100, Mario Kleiner wrote:
On Nov 16, 2011, at 7:48 PM, Matthew Garrett wrote:
I'll admit that I'm struggling to understand the issue here. If the
vblank counter is incremented at the time of vblank (which isn't the
case for radeon, it seems, but as far as I can tell is the case for
Intel) then how does ping-ponging the IRQ matter?
vblank_disable_and_save() appears to handle this case.

The drm software vblank counter which is used for scheduling swaps
etc. gets incremented in the gpu's vblank irq handler. The gpu's
hardware counter gets incremented somewhere inside the vblank
interval. The increments don't happen at the same point in time. The
race is between the vblank on/off code, which gets scheduled either
by the timeout timer for the "off" case, or by usercode for the "on"
case and the gpu's hardware vblank counter.

Yes, that makes sense given the current design.

The existing code uses a lock (vblank_time_lock) to prevent some
races between itself and the vblank irq handler, but it can't "lock"
the gpu, so if the enable/disable code executes while the gpu is
scanning out the vblank interval, it is undefined if the final
vblank count will be correct or off by one. Vblank duration is
somewhere up to 4.5% of refresh interval duration, so there's your
up to 4% chance of getting it wrong.

Well presumably by "undefined" we really mean "hardware-specific" -  
for
any given well-functioning hardware I'd expect the answer to be
well-defined. Like I said, with Radeon we have the option of  
triggering
an interrupt at the point where the hardware counter is actually
incremented. If that then shuts down the vblank interrupt then we  
should
have a well-defined answer.

Yes, "hardware-specific". It is only "undefined" from the point of  
view of the drm core in the sense that the code in the core is  
currently agnostic to what the underlying gpu/driver does.

...

Use the crtc scanout position queries together with the vblank
counter queries inside some calibration loop, maybe executed after
each modeset, to find out the scanline range in which the hardware
vblank counter increments -- basically a forbidden range of scanline
positions where the race would happen. Then at each vblank off/on,
query scanout position before and after the hw vblank counter query.
If according to the scanout positions the vblank counter query
happened within the forbidden time window, retry the query. With a
well working calibration that should add no delay in most cases and
a delay to the on/off code of a few dozen microseconds (=duration of
a few scanlines) worst case.

Assuming we're sleeping rather than busy-looping, that's certainly ok.
My previous experiments with radeon indicated that the scanout irq was
certainly not entirely reliable - on the other hand, I was trying  
to use
it for completing memory reclocking within the vblank interval. It was
typically still within a few scanlines, so a sanity check there  
wouldn't
pose too much of a problem.

Sleeping in the timer triggered off path would be ok, but in the on- 
path we need to be relatively fast, so i think some kind of cpu  
friendly busy waiting (cpu_relax() or something like that, if i  
understand its purpose?) probably would be neccessary.

I've no problem with all of this work being case by case.

Me neither. I just say if you'd go to the extreme and disable vblank
irq's immediately with zero delay, without reliably fixing the
problem i mentioned, you'd get those off by one counts, which would
be very bad for apps that rely on precise timing. Doing so would
basically make the whole oml_sync_control implementation mostly
useless.

Right. My testing of sandybridge suggests that there wasn't a problem
here - even with the ping-ponging I was reliably getting 60 interrupts
in 60 seconds, with the counter incrementing by 1 each time. I  
certainly
wouldn't push to enable it elsewhere without making sure that the
results are reliable.

How hard did you test it? Given that the off-by-one is a race- 
condition, there is a good chance you miss the bad cases due to lucky  
timing. I think one way to test the ping-pong case might be a tight  
loop with calls to glXGetSyncValuesOML(), or at a lower level a tight  
loop to the drmWaitVblank() ioctl, which is what glXGetSyncValuesOML 
() does.

That would be a sequence of drm_vblank_get() -> irq's on, reinitalize  
vblank counter and timestamps -> Fetch values -> drm_vblank_put() ->  
irq's off etc. If you run this with a couple of thousand calls per  
second there would be some coverage of the vblank period. Either that  
or some funkyness because we wouldn't exercise the vblank irq anymore  
for vblank counting and miss something there for some reason...

If vblanks are disabled and then re-enabled between 1 and 3,
what's the
negative outcome?

1. glXGetSyncValuesOML() gives you the current vblank count and a
timestamp of when exactly scanout of the corresponding video frame
started. These values are the baseline for any vblank based swap
scheduling or any other vblank synchronized actions, e.g., via
glXWaitForMscOML().

2. App calculates stuff based on 1. but gets preempted or
experiences scheduling delay on return from 1. - or the x-server
gets preempted or scheduled away. Whatever, time passes...

3. glXSwapBuffersMscOML() or glXWaitForMscOML() is used to schedule
something to occur at a target vblank count, based on the results of
1.

As long as vblanks don't get disabled between 1 and 3, everything is
consistent. Scheduling delays/preemption of more than a few (dozen)
milliseconds worst case are very rare, so a lowered vblank off delay
of even one or two refresh cycles would solve the problem. I assume
that most toolkits with needs for timing precision would follow a
three step strategy like this one.

If vblank irq's get disabled immediately after step 1. and reenabled
at step 3., and this triggers an off by one error due to a race,
then the calculated target vblank count from steps 1/2 is invalid
and useless for step 3 and many vision scientists which just started
to make friendship with Linux lately will make very sad faces. Due
to the nature of many of the stimulation paradigms used, there is
also a high chance that many of the enables and disables would
happen inside or very close to the problematic vblank interval when
off by one error is most likely.

Ok, so as long as we believe that we're reliably reading the hardware
counter and not getting off by one, we should be ok? I just want to be
clear that this is dependent on hardware behaviour rather than being
absolutely inherent :)

Yes. The way we need to find the final/new software/hardware vblank  
count at irq off /on time must be consistent with what would have  
happened if the counters would have been incremented by "just another  
vblank irq".

But apart from this, i would assume that even if we can remove all  
races, it probably would still make sense to always have some small  
vblank off delay, even if it is only half a video refresh cycle?  
During execution of a bufferswap or a three-step procedure like my  
toolkit does, or a desktop composition pass (at least the way compiz  
does it, didn't check other compositors), we will usually have  
multiple drm_vblank_get() -> drm_vblank_put() invocations in quick  
succession, triggered by the ddx inside the x-server and the drm code  
during swap preparation and swap completion. Without a delay of at  
least a few milliseconds the code would turn on and off the irq's  
multiple times within a few milliseconds, each time involving  
locking, transactions with the gpu, some memory barriers, etc. As  
long as the desktop is busy with some OpenGL animations, vblank irq's  
will neccessarily fire for each refresh cycle regardless what the  
timeout is. And for a small timeout of a few milliseconds, when the  
desktop finally goes idle, you'd probably end up with at most one  
extra vblank irq, but save a little bit of cpu time for multiple on/ 
off transitions for each frame during the busy period.

Out of pure interest, how much power does vblank off actually save,  
compared to waking up the cpu every 16 msecs?

I agree with your patches, i just don't want vblank irq's to be
disabled immediately with a zero timeout before remaining races are
fixed, that would sure break things at least for scientific
applications. And i'd like to keep some optional parameter that
allows desparate users to disable the vblank off mechanism in case
they would encounter a bug.

Yeah, no problem. I understand that completely.

Great, thanks.
-mario

--
Matthew Garrett | mjg59@xxxxxxxxxxxxx

*********************************************************************
Mario Kleiner
Max Planck Institute for Biological Cybernetics
Spemannstr. 38
72076 Tuebingen
Germany

e-mail: mario.kleiner@xxxxxxxxxxxxxxxx
office: +49 (0)7071/601-1623
fax:    +49 (0)7071/601-616
www:    http://www.kyb.tuebingen.mpg.de/~kleinerm
*********************************************************************
"For a successful technology, reality must take precedence
over public relations, for Nature cannot be fooled."
(Richard Feynman)

_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/dri-devel