Hi Grazvydas, Kevin, I did some gather some performance measurements and statistics using custom tracepoints in __omap3_enter_idle. All the details are at http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis . The setup is: - Beagleboard (OMAP3530) at 500MHz, - l-o master kernel + functional power states + per-device PM QoS. It has been checked that the changes from l-o master do not have an impact on the performance. - The data transfer is performed using dd from a file in JFFS2 to /dev/null: 'dd if=/tmp/mnt/a of=/dev/null bs=1M count=32'. On Tue, Apr 17, 2012 at 4:30 PM, Kevin Hilman <khilman@xxxxxx> wrote: > Grazvydas Ignotas <notasas@xxxxxxxxx> writes: > >> On Thu, Apr 12, 2012 at 3:19 AM, Kevin Hilman <khilman@xxxxxx> wrote: >>> It would be helpful now to narrow down what are the big contributors to >>> the overhead in omap_sram_idle(). Most of the code there is skipped for >>> C1 because the next states for MPU and CORE are both ON. >> >> Ok I did some tests, all in mostly idle system with just init, busybox >> shell and dd doing a NAND read to /dev/null . > ... > >> MB/s is throughput that >> dd reports, mA and approx. current draw during the transfer, read from >> fuel gauge that's onboard. >> >> MB/s| mA|comment >> 3.7|218|mainline f549e088b80 >> 3.8|224|nand qos PM_QOS_CPU_DMA_LATENCY 0 [1] >> 4.4|220|[1] + pwrdm_p*_transition commented [2] >> 3.8|225|[1] + omap34xx_do_sram_idle->cpu_do_idle [3] >> 4.2|210|[1] + pwrdm_set_next_pwrst(per_pd, PWRDM_POWER_ON) [4] >> 4.0|224|[1] + 'Deny idle' [5] >> 5.1|210|[2] + [4] + [5] >> 5.2|202|[5] + omap_sram_idle->cpu_do_idle [6] >> 5.5|243|!CONFIG_PM >> 6.1|282|busywait DMA end (for reference) Here are the results (BW in MB/s) on Beagleboard: - 4.7: without using DMA, - Using DMA 2.1: [0] 2.1: [1] only C1 2.6: [1]+[2] no pre_ post_ 2.3: [1]+[5] no pwrdm_for_each_clkdm 2.8: [1]+[5]+[2] 3.1: [1]+[5]+[6] no omap_sram_idle 3.1: No IDLE, no omap_sram_idle, all pwrdms to ON So indeed this shows there is some serious performance issue with the C1 C-state. > Thanks for the detailed experiments. This definitely confirms we have > some serious unwanted overhead for C1, and our C-state latency values > are clearly way off base, since they only account HW latency and not any > of the SW latency introduced in omap_sram_idle(). > >>> There are 2 primary differences that I see as possible causes. I list >>> them here with a couple more experiments for you to try to help us >>> narrow this down. >>> >>> 1) powerdomain accounting: pwrdm_pre_transition(), pwrdm_post_transition() >>> >>> Could you try using omap_sram_idle() and just commenting out those >>> calls? Does that help performance? Those iterate over all the >>> powerdomains, so defintely add some overhead, but I don't think it >>> would be as significant as what you're seeing. >> >> Seems to be taking good part of it. >> >>> Much more likely is... >>> >>> 2) jump to SRAM, SDRC self-refresh, SDRC errata workarounds >> >> Could not notice any difference. >> >> To me it looks like this results from many small things adding up.. >> Idle is called so often that pwrdm_p*_transition() and those >> pwrdm_for_each_clkdm() walks start slowing everything down, perhaps >> because they access lots of registers on slow buses? >From the list of contributors, the main ones are: (140us) pwrdm_pre_transition and pwrdm_post_transition, (105us) omap2_gpio_prepare_for_idle and omap2_gpio_resume_after_idle. This could be avoided if PER stays ON in the latency-critical C-states, (78us) pwrdm_for_each_clkdm(mpu, core, deny_idle/allow_idle), (33us estimated) omap_set_pwrdm_state(mpu, core, neon), (11 us) clkdm_allow_idle(mpu). Is this needed? Here are a few questions and suggestions: - In case of latency critical C-states could the high-latency code be bypassed in favor of a much simpler version? Pushing the concept a bit farther one could have a C1 state that just relaxes the cpu (no WFI), a C2 state which bypasses a lot of code in __omap3_enter_idle, and the rest of the C-states as we have today, - Is it needed to iterate through all the power and clock domains in order to keep them active? - Trying to idle some non related power domains (e.g. PER) causes a performance hit. How to link all the power domains states to the cpuidle C-state? The per-device PM QoS framework could be used to constraint some power domains, but this is highly dependent on the use case. > Yes PRCM register accesses are unfortunately rather slow, and we've > known that for some time, but haven't done any detailed analysis of the > overhead. That would be worth doing the analysis. A lot of read accesses to the current, next and previous power states are performed in the idle code. > Using the function_graph tracer, I was able to see that the pre/post > transition are taking an enormous amount of time: > > - pwrdm pre-transition: 1400+ us at 600MHz (4000+ us at 125MHz) > - pwrdm post-transtion: 1600+ us at 600MHz (6000+ us at 125MHz) > > Notice the big difference between 600MHz OPP and 125MHz OPP. Are you > using CPUfreq at all in your tests? If using cpufreq + ondemand > governor, you're probably running at low OPP due to lack of CPU activity > which will also affect the latencies in the idle path. > >> Maybe some register cache would help us there, or are those registers >> expected to be changed by hardware often? > > Yes, we've known that some sort of register cache here would be useful > for some time, but haven't got to implementing it. I can try some proof of concept code, just to prove its usefulness. >> Also trying to idle PER while transfer is ongoing (as reported in >> previous mail) doesn't sound like a good idea and is one of the >> reasons for slowdown. Seems to also causing more current drain, >> ironically. > > Agreed. Again, using the function_graph tracer, I get some pretty big > latencies from the GPIO pre/post idling process: > > - gpio_prepare_for_idle(): 2400+ us at 600MHz (8200+ us at 125MHz) > - gpio_resume_from_idle(): 2200+ us at 600MHz (7600+ us at 125MHz) > > Removing PER transtions as you did will get rid of those. > > I'm looking into this in more detail know, and will likely have a few > patches for you to experiment with. > > Thanks again for digging into this with us, > > Kevin > Any thoughts? Regards, Jean >> >> >> changes (again, sorry for corrupted diffs, but they should be easy to >> reproduce): >> [2]: >> --- a/arch/arm/mach-omap2/pm34xx.c >> +++ b/arch/arm/mach-omap2/pm34xx.c >> @@ -307,7 +307,7 @@ void omap_sram_idle(void) >> omap3_enable_io_chain(); >> } >> >> - pwrdm_pre_transition(); >> +// pwrdm_pre_transition(); >> >> /* PER */ >> if (per_next_state < PWRDM_POWER_ON) { >> @@ -372,7 +373,7 @@ void omap_sram_idle(void) >> } >> omap3_intc_resume_idle(); >> >> - pwrdm_post_transition(); >> +// pwrdm_post_transition(); >> >> /* PER */ >> if (per_next_state < PWRDM_POWER_ON) { >> [3]: >> --- a/arch/arm/mach-omap2/pm34xx.c >> +++ b/arch/arm/mach-omap2/pm34xx.c >> @@ -347,7 +347,7 @@ void omap_sram_idle(void) >> if (save_state == 1 || save_state == 3) >> cpu_suspend(save_state, omap34xx_do_sram_idle); >> else >> - omap34xx_do_sram_idle(save_state); >> + cpu_do_idle(); >> >> /* Restore normal SDRC POWER settings */ >> if (cpu_is_omap3430() && omap_rev() >= OMAP3430_REV_ES3_0 && >> [4]: >> --- a/arch/arm/mach-omap2/cpuidle34xx.c >> +++ b/arch/arm/mach-omap2/cpuidle34xx.c >> @@ -107,6 +107,7 @@ static int __omap3_enter_idle(struct cpuidle_device *dev, >> if (index == 0) { >> pwrdm_for_each_clkdm(mpu_pd, _cpuidle_deny_idle); >> pwrdm_for_each_clkdm(core_pd, _cpuidle_deny_idle); >> + pwrdm_set_next_pwrst(per_pd, PWRDM_POWER_ON); >> } >> >> /* >> [5]: >> --- a/arch/arm/mach-omap2/cpuidle34xx.c >> +++ b/arch/arm/mach-omap2/cpuidle34xx.c >> @@ -105,8 +105,7 @@ static int __omap3_enter_idle(struct cpuidle_device *dev, >> >> /* Deny idle for C1 */ >> if (index == 0) { >> - pwrdm_for_each_clkdm(mpu_pd, _cpuidle_deny_idle); >> - pwrdm_for_each_clkdm(core_pd, _cpuidle_deny_idle); >> + clkdm_deny_idle(mpu_pd->pwrdm_clkdms[0]); >> } >> >> /* >> @@ -128,8 +128,7 @@ static int __omap3_enter_idle(struct cpuidle_device *dev, >> >> /* Re-allow idle for C1 */ >> if (index == 0) { >> - pwrdm_for_each_clkdm(mpu_pd, _cpuidle_allow_idle); >> - pwrdm_for_each_clkdm(core_pd, _cpuidle_allow_idle); >> + clkdm_allow_idle(mpu_pd->pwrdm_clkdms[0]); >> } >> >> return_sleep_time: >> [6]: >> --- a/arch/arm/mach-omap2/cpuidle34xx.c >> +++ b/arch/arm/mach-omap2/cpuidle34xx.c >> @@ -117,7 +116,8 @@ static int __omap3_enter_idle(struct cpuidle_device *dev, >> cpu_pm_enter(); >> >> /* Execute ARM wfi */ >> - omap_sram_idle(); >> + //omap_sram_idle(); >> + cpu_do_idle(); >> >> /* >> * Call idle CPU PM enter notifier chain to restore > -- > To unsubscribe from this list: send the line "unsubscribe linux-omap" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-omap" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html