Re: v3.4-rc4 DSS PM problem (Was: Re: Problems with 3.4-rc5)

Paul Walmsley <paul@xxxxxxxxx> · Thu, 24 May 2012 18:39:02 -0600 (MDT)

cc Jean

Hello Tomi,

On Wed, 16 May 2012, Tomi Valkeinen wrote:

> I also suspect that this could be just a plain DSS bug. The default FIFO
> low/high thresholds are 960/1023 bytes (i.e. DSS starts refilling the
> FIFO when there are 960 or less bytes in the fifo, and stops at 1023.
> The fifo is 1024 bytes). The values are calculated with fifo_size -
> burst_size and fifo_size - 1.
> 
> We are now using FIFO merge features, which combines multiple fifos into
> one when possible, making the fifo size 1024*3 = 3072. Using the same
> low threshold and increasing the high threshold to 960/3071 works fine.
> Changing the high threshold to 3008 causes underflows. Increasing the
> low threshold to ~1600 makes DSS work again.

Just a few thoughts.

In terms of the high threshold, it seems really strange to me that 
changing the high threshold would make such a difference.  Naïvely, I'd 
assume that you'd want to set it as high as possible?  I suppose in cases 
where the interconnect is congested, setting it lower might allow lower 
latency for other interconnect users, but I'd hope we don't have to worry 
much about that.  So it doesn't seem to me that there would be any 
advantage to setting it lower than the maximum.

Probably the low threshold is the more important parameter, from a PM 
perspective.  If you know the FIFO's drain rate and the low threshold, it 
should be possible to calculate the maximum latency that the FIFO can 
tolerate to avoid an underflow.  This could be used to specify a device PM 
QoS constraint to prevent the interconnect latency from exceeding that 
value.

I'd guess the calculations would be something like this -- (I hope you can 
correct my relative ignorance of the DSS in the following estimates):

Looking at mach-omap2/board-rx51-video.c, let's suppose that the FIFO 
drain rate would be 864 x 480 x 32 bits/second.  Since the FIFO width is 
32 bits, that's

   864 x 480 = 414 780 FIFO entries/second, or

   (1 000 000 µs/s / 414 780 FIFO entries/s) = ~2.411 µs/FIFO entry.

So if you need a low FIFO threshold at 960 entries, you could call the 
device PM QoS functions to set a wakeup latency constraint for the 
interconnect would be nothing greater than this:

   (2.411 µs/FIFO entry * 960 FIFO entries) = 2 314.96 µs

(The reality is that it would need to be something less than this, to 
account for the time needed for the GFX DMA transfer to start supplying 
data, etc.)

The ultimate goal, with Jean's device PM QoS patches, is that these 
constraints could change the DPLL autoidle settings or powerdomain states 
to ensure the constraint was met.  He's got a page here:

  http://omappedia.org/wiki/Power_Management_Device_Latencies_Measurement

(Unfortunately it's not clear what the DPLL autoidle modes and voltage 
scaling bits are set to for many of the estimates, and we also know that 
there are many software optimizations possible for our idle path.)

We're still working on getting the OMAP device PM QoS patches merged, but 
the Linux core support is there, so you should be able to patch your 
drivers to use them -- see for example dev_pm_qos_add_request().

...

Similarly, for the low-power refresh case, if you know the GFX FIFO drain 
rate and the various latencies, it should be possible to estimate the 
minimum low threshold value needed in order to avoid a FIFO underflow.

(By "various latencies," I mean the DPLL relock latency, the GFX DMA 
latency between initiating a transfer and receiving the first result data, 
etc.  Some of these latencies may be difficult to estimate accurately.  
But if the major sources of variation can be identified, such as DPLL 
relock time or GFX DMA FIFO refill time, I'd hope we can just use trial 
and error to find some worst-case constant for the rest.)

The goal in this ase would be to allow DPLL3 to stay unlocked for as long 
as possible, to save energy.  This would imply finding the lowest possible 
FIFO low threshold that doesn't generate underflows.  Using the lowest 
possible low threshold should leave as much room as possible in the FIFO 
for data, and thus maximize the amount of time that DPLL3 can stay 
unlocked after the high threshold is reached.

Since the DPLL relock latency figures are known from the TRM section 
4.7.6.7 "Latencies," we can estimate the DPLL's contribution to the low 
threshold setting.  The DPLL relock latency depends on the DPLL's input 
rate and some DPLL settings, so it can vary.  (We probably need a 
function for the interconnect device that can estimate the worst-case 
wakeup latency for the DSS to use, based on the rest of the system 
settings.)

Let's reuse the 2.411 µs/FIFO entry estimate from above.  For convenience, 
let's suppose that the DPLL relock latency from DPLL-OFF is 1.5 ms = 1500 
µs.  So we know that the number of FIFO slots needed simply to endure the 
DPLL relock process is

   CEIL(1500 µs/relock / 2.411 µs/FIFO entry) = CEIL(622.14 ...) = 
       623 FIFO entries/relock

This of course doesn't account for the time needed for the GFX DMA 
transfer to start delivering useful data, any voltage scaling needed, etc.

...

Just paging through the DSS TRM section, some other settings that might be 
worth checking are:

- is DISPC_GFX_ATTRIBUTES.GFXBURSTSIZE set to 16x32?

- is DISPC_GFX_ATTRIBUTES.GFXFIFOPRELOAD set to 1?

- is DISPC_GFX_PRELOAD.PRELOAD set to the maximum possible value?

- is DISPC_CONFIG.FIFOFILLING set to 1?

> So I think that the high thresholds of 3071 and 3008 are so close to
> each other that there shouldn't be any real difference in practice,
> presuming everything works. But, for whatever reason, fetching of the
> pixels becomes much more inefficient or with much higher start latency,
> causing the underflows.

That's really weird.

- Paul