Re: PM related performance degradation on OMAP3

Jean Pihet <jean.pihet@xxxxxxxxxxxxxx> · Tue, 24 Apr 2012 14:50:43 +0200

Hi Tero,

On Tue, Apr 24, 2012 at 2:21 PM, Tero Kristo <t-kristo@xxxxxx> wrote:
> On Tue, 2012-04-24 at 16:08 +0530, Santosh Shilimkar wrote:
>> + Tero
>>
>> On Tuesday 24 April 2012 03:20 PM, Jean Pihet wrote:
>> > Hi Grazvydas, Kevin,
>> >
>> > I did some gather some performance measurements and statistics using
>> > custom tracepoints in __omap3_enter_idle.
>> > All the details are at
>> > http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis
>> > .
>> >
>> Nice data.
>>
>> > The setup is:
>> > - Beagleboard (OMAP3530) at 500MHz,
>> > - l-o master kernel + functional power states + per-device PM QoS. It
>> > has been checked that the changes from l-o master do not have an
>> > impact on the performance.
>> > - The data transfer is performed using dd from a file in JFFS2 to
>> > /dev/null: 'dd if=/tmp/mnt/a of=/dev/null bs=1M count=32'.
>
> Question: what is used for gathering the latency values?
I used ftrace tracepoints which are supposed to be low overhead. I
checked that the overhead cannot be measured on the measurement
interval (>400us), given the fact that the time base is 31us (32 KHz
clock).

>> >
>> > On Tue, Apr 17, 2012 at 4:30 PM, Kevin Hilman <khilman@xxxxxx> wrote:
>> >> Grazvydas Ignotas <notasas@xxxxxxxxx> writes:
>> >>
>> >>> On Thu, Apr 12, 2012 at 3:19 AM, Kevin Hilman <khilman@xxxxxx> wrote:
>> >>>> It would be helpful now to narrow down what are the big contributors to
>> >>>> the overhead in omap_sram_idle().  Most of the code there is skipped for
>> >>>> C1 because the next states for MPU and CORE are both ON.
>> >>>
>> >>> Ok I did some tests, all in mostly idle system with just init, busybox
>> >>> shell and dd doing a NAND read to /dev/null .
>> >>
>> > ...
>> >>
>> >>> MB/s is throughput that
>> >>> dd reports, mA and approx. current draw during the transfer, read from
>> >>> fuel gauge that's onboard.
>> >>>
>> >>> MB/s| mA|comment
>> >>>  3.7|218|mainline f549e088b80
>> >>>  3.8|224|nand qos PM_QOS_CPU_DMA_LATENCY 0 [1]
>> >>>  4.4|220|[1] + pwrdm_p*_transition commented [2]
>> >>>  3.8|225|[1] + omap34xx_do_sram_idle->cpu_do_idle [3]
>> >>>  4.2|210|[1] + pwrdm_set_next_pwrst(per_pd, PWRDM_POWER_ON) [4]
>> >>>  4.0|224|[1] + 'Deny idle' [5]
>> >>>  5.1|210|[2] + [4] + [5]
>> >>>  5.2|202|[5] + omap_sram_idle->cpu_do_idle [6]
>> >>>  5.5|243|!CONFIG_PM
>> >>>  6.1|282|busywait DMA end (for reference)
>> >
>> > Here are the results (BW in MB/s) on Beagleboard:
>> > - 4.7: without using DMA,
>> >
>> > - Using DMA
>> >   2.1: [0]
>> >   2.1: [1] only C1
>> >   2.6: [1]+[2] no pre_ post_
>> >   2.3: [1]+[5] no pwrdm_for_each_clkdm
>> >   2.8: [1]+[5]+[2]
>> >   3.1: [1]+[5]+[6] no omap_sram_idle
>> >   3.1: No IDLE, no omap_sram_idle, all pwrdms to ON
>> >
>> > So indeed this shows there is some serious performance issue with the
>> > C1 C-state.
>> >
>> Looks like other clock-domain (notably l4, per, AON) should be denied
>> idle in C1 to avoid the huge penalties. It might just do the trick.
>>
>>
>> >> Thanks for the detailed experiments.  This definitely confirms we have
>> >> some serious unwanted overhead for C1, and our C-state latency values
>> >> are clearly way off base, since they only account HW latency and not any
>> >> of the SW latency introduced in omap_sram_idle().
>> >>
>> >>>> There are 2 primary differences that I see as possible causes.  I list
>> >>>> them here with a couple more experiments for you to try to help us
>> >>>> narrow this down.
>> >>>>
>> >>>> 1) powerdomain accounting: pwrdm_pre_transition(), pwrdm_post_transition()
>> >>>>
>> >>>> Could you try using omap_sram_idle() and just commenting out those
>> >>>> calls?  Does that help performance?  Those iterate over all the
>> >>>> powerdomains, so defintely add some overhead, but I don't think it
>> >>>> would be as significant as what you're seeing.
>> >>>
>> >>> Seems to be taking good part of it.
>> >>>
>> >>>>    Much more likely is...
>> >>>>
>> >>>> 2) jump to SRAM, SDRC self-refresh, SDRC errata workarounds
>> >>>
>> >>> Could not notice any difference.
>> >>>
>> >>> To me it looks like this results from many small things adding up..
>> >>> Idle is called so often that pwrdm_p*_transition() and those
>> >>> pwrdm_for_each_clkdm() walks start slowing everything down, perhaps
>> >>> because they access lots of registers on slow buses?
>> >
>> > From the list of contributors, the main ones are:
>> >     (140us) pwrdm_pre_transition and pwrdm_post_transition,
>>
>> I have observed this one on OMAP4 too. There was a plan to remove
>> this as part of Tero's PD/CD use-counting series.
>
> pwrdm_pre / post transitions could be optimized a bit already now. They
> only should need to be called for mpu / core and per domains, but
> currently they scan through everything.
>
>>
>> >     (105us) omap2_gpio_prepare_for_idle and
>> > omap2_gpio_resume_after_idle. This could be avoided if PER stays ON in
>> > the latency-critical C-states,
>> Yes. In C1 when you deny idle for per, there should be no need to
>> call this. But even in the case when it is called, why is it taking
>> 105 uS. Needs to dig further.
>>
>> >     (78us) pwrdm_for_each_clkdm(mpu, core, deny_idle/allow_idle),
>> Depending on OPP, a PRCM read can take upto ~12-14 uS, so above
>> shouldn't be surprising.
>>
>> >     (33us estimated) omap_set_pwrdm_state(mpu, core, neon),
>> This is again dominated by PRCM read
>>
>> >     (11 us) clkdm_allow_idle(mpu). Is this needed?
>> >
>> I guess yes other wise when C2+ is attempted MPU CD can't idle.
>>
>> > Here are a few questions and suggestions:
>> > - In case of latency critical C-states could the high-latency code be
>> > bypassed in favor of a much simpler version? Pushing the concept a bit
>> > farther one could have a C1 state that just relaxes the cpu (no WFI),
>> > a C2 state which bypasses a lot of code in __omap3_enter_idle, and the
>> > rest of the C-states as we have today,
>> We should do that. Infact C1 state should be as lite as possible like
>> WFI or so.
>>
>> > - Is it needed to iterate through all the power and clock domains in
>> > order to keep them active?
>> That iteration should be removed.
>>
>> > - Trying to idle some non related power domains (e.g. PER) causes a
>> > performance hit. How to link all the power domains states to the
>> > cpuidle C-state? The per-device PM QoS framework could be used to
>> > constraint some power domains, but this is highly dependent on the use
>> > case.
>> >
>> Note that just limiting PER PD state to ON is not going to
>> solve the penalty. You need to avoid per CD transition and
>> hence deny idle. I remember Nokia team did this on some
>> products.
>
> n9 kernel (which is available here
> http://harmattan-dev.nokia.com/pool/harmattan/free/k/kernel/) contained
> a lot of optimizations in the idle path. Maybe someone should take a
> look at this at some point.
Ok, thanks for the link.

>
>>
>>
>> >> Yes PRCM register accesses are unfortunately rather slow, and we've
>> >> known that for some time, but haven't done any detailed analysis of the
>> >> overhead.
>> > That would be worth doing the analysis. A lot of read accesses to the
>> > current, next and previous power states are performed in the idle
>> > code.
>> >
>> >> Using the function_graph tracer, I was able to see that the pre/post
>> >> transition are taking an enormous amount of time:
>> >>
>> >>  - pwrdm pre-transition: 1400+ us at 600MHz (4000+ us at 125MHz)
>> >>  - pwrdm post-transtion: 1600+ us at 600MHz (6000+ us at 125MHz)
>> >>
>> >> Notice the big difference between 600MHz OPP and 125MHz OPP.  Are you
>> >> using CPUfreq at all in your tests?  If using cpufreq + ondemand
>> >> governor, you're probably running at low OPP due to lack of CPU activity
>> >> which will also affect the latencies in the idle path.
>> >>
>> >>> Maybe some register cache would help us there, or are those registers
>> >>> expected to be changed by hardware often?
>> >>
>> >> Yes, we've known that some sort of register cache here would be useful
>> >> for some time, but haven't got to implementing it.
>> > I can try some proof of concept code, just to prove its usefulness.
>> >
>> Please do so. We were hoping that after Tero's series, we don't need
>> this pre/post stuff but am not sure if Tero is addressing that.
>>
>> Register cache initiative is most welcome.
>>
>> Regards
>> Santosh
>
>

Regards,
Jean
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html