Hi Tero, On Tue, Apr 24, 2012 at 2:21 PM, Tero Kristo <t-kristo@xxxxxx> wrote: > On Tue, 2012-04-24 at 16:08 +0530, Santosh Shilimkar wrote: >> + Tero >> >> On Tuesday 24 April 2012 03:20 PM, Jean Pihet wrote: >> > Hi Grazvydas, Kevin, >> > >> > I did some gather some performance measurements and statistics using >> > custom tracepoints in __omap3_enter_idle. >> > All the details are at >> > http://www.omappedia.org/wiki/Power_Management_Device_Latencies_Measurement#C1_performance_problem:_analysis >> > . >> > >> Nice data. >> >> > The setup is: >> > - Beagleboard (OMAP3530) at 500MHz, >> > - l-o master kernel + functional power states + per-device PM QoS. It >> > has been checked that the changes from l-o master do not have an >> > impact on the performance. >> > - The data transfer is performed using dd from a file in JFFS2 to >> > /dev/null: 'dd if=/tmp/mnt/a of=/dev/null bs=1M count=32'. > > Question: what is used for gathering the latency values? I used ftrace tracepoints which are supposed to be low overhead. I checked that the overhead cannot be measured on the measurement interval (>400us), given the fact that the time base is 31us (32 KHz clock). >> > >> > On Tue, Apr 17, 2012 at 4:30 PM, Kevin Hilman <khilman@xxxxxx> wrote: >> >> Grazvydas Ignotas <notasas@xxxxxxxxx> writes: >> >> >> >>> On Thu, Apr 12, 2012 at 3:19 AM, Kevin Hilman <khilman@xxxxxx> wrote: >> >>>> It would be helpful now to narrow down what are the big contributors to >> >>>> the overhead in omap_sram_idle(). Most of the code there is skipped for >> >>>> C1 because the next states for MPU and CORE are both ON. >> >>> >> >>> Ok I did some tests, all in mostly idle system with just init, busybox >> >>> shell and dd doing a NAND read to /dev/null . >> >> >> > ... >> >> >> >>> MB/s is throughput that >> >>> dd reports, mA and approx. current draw during the transfer, read from >> >>> fuel gauge that's onboard. >> >>> >> >>> MB/s| mA|comment >> >>> 3.7|218|mainline f549e088b80 >> >>> 3.8|224|nand qos PM_QOS_CPU_DMA_LATENCY 0 [1] >> >>> 4.4|220|[1] + pwrdm_p*_transition commented [2] >> >>> 3.8|225|[1] + omap34xx_do_sram_idle->cpu_do_idle [3] >> >>> 4.2|210|[1] + pwrdm_set_next_pwrst(per_pd, PWRDM_POWER_ON) [4] >> >>> 4.0|224|[1] + 'Deny idle' [5] >> >>> 5.1|210|[2] + [4] + [5] >> >>> 5.2|202|[5] + omap_sram_idle->cpu_do_idle [6] >> >>> 5.5|243|!CONFIG_PM >> >>> 6.1|282|busywait DMA end (for reference) >> > >> > Here are the results (BW in MB/s) on Beagleboard: >> > - 4.7: without using DMA, >> > >> > - Using DMA >> > 2.1: [0] >> > 2.1: [1] only C1 >> > 2.6: [1]+[2] no pre_ post_ >> > 2.3: [1]+[5] no pwrdm_for_each_clkdm >> > 2.8: [1]+[5]+[2] >> > 3.1: [1]+[5]+[6] no omap_sram_idle >> > 3.1: No IDLE, no omap_sram_idle, all pwrdms to ON >> > >> > So indeed this shows there is some serious performance issue with the >> > C1 C-state. >> > >> Looks like other clock-domain (notably l4, per, AON) should be denied >> idle in C1 to avoid the huge penalties. It might just do the trick. >> >> >> >> Thanks for the detailed experiments. This definitely confirms we have >> >> some serious unwanted overhead for C1, and our C-state latency values >> >> are clearly way off base, since they only account HW latency and not any >> >> of the SW latency introduced in omap_sram_idle(). >> >> >> >>>> There are 2 primary differences that I see as possible causes. I list >> >>>> them here with a couple more experiments for you to try to help us >> >>>> narrow this down. >> >>>> >> >>>> 1) powerdomain accounting: pwrdm_pre_transition(), pwrdm_post_transition() >> >>>> >> >>>> Could you try using omap_sram_idle() and just commenting out those >> >>>> calls? Does that help performance? Those iterate over all the >> >>>> powerdomains, so defintely add some overhead, but I don't think it >> >>>> would be as significant as what you're seeing. >> >>> >> >>> Seems to be taking good part of it. >> >>> >> >>>> Much more likely is... >> >>>> >> >>>> 2) jump to SRAM, SDRC self-refresh, SDRC errata workarounds >> >>> >> >>> Could not notice any difference. >> >>> >> >>> To me it looks like this results from many small things adding up.. >> >>> Idle is called so often that pwrdm_p*_transition() and those >> >>> pwrdm_for_each_clkdm() walks start slowing everything down, perhaps >> >>> because they access lots of registers on slow buses? >> > >> > From the list of contributors, the main ones are: >> > (140us) pwrdm_pre_transition and pwrdm_post_transition, >> >> I have observed this one on OMAP4 too. There was a plan to remove >> this as part of Tero's PD/CD use-counting series. > > pwrdm_pre / post transitions could be optimized a bit already now. They > only should need to be called for mpu / core and per domains, but > currently they scan through everything. > >> >> > (105us) omap2_gpio_prepare_for_idle and >> > omap2_gpio_resume_after_idle. This could be avoided if PER stays ON in >> > the latency-critical C-states, >> Yes. In C1 when you deny idle for per, there should be no need to >> call this. But even in the case when it is called, why is it taking >> 105 uS. Needs to dig further. >> >> > (78us) pwrdm_for_each_clkdm(mpu, core, deny_idle/allow_idle), >> Depending on OPP, a PRCM read can take upto ~12-14 uS, so above >> shouldn't be surprising. >> >> > (33us estimated) omap_set_pwrdm_state(mpu, core, neon), >> This is again dominated by PRCM read >> >> > (11 us) clkdm_allow_idle(mpu). Is this needed? >> > >> I guess yes other wise when C2+ is attempted MPU CD can't idle. >> >> > Here are a few questions and suggestions: >> > - In case of latency critical C-states could the high-latency code be >> > bypassed in favor of a much simpler version? Pushing the concept a bit >> > farther one could have a C1 state that just relaxes the cpu (no WFI), >> > a C2 state which bypasses a lot of code in __omap3_enter_idle, and the >> > rest of the C-states as we have today, >> We should do that. Infact C1 state should be as lite as possible like >> WFI or so. >> >> > - Is it needed to iterate through all the power and clock domains in >> > order to keep them active? >> That iteration should be removed. >> >> > - Trying to idle some non related power domains (e.g. PER) causes a >> > performance hit. How to link all the power domains states to the >> > cpuidle C-state? The per-device PM QoS framework could be used to >> > constraint some power domains, but this is highly dependent on the use >> > case. >> > >> Note that just limiting PER PD state to ON is not going to >> solve the penalty. You need to avoid per CD transition and >> hence deny idle. I remember Nokia team did this on some >> products. > > n9 kernel (which is available here > http://harmattan-dev.nokia.com/pool/harmattan/free/k/kernel/) contained > a lot of optimizations in the idle path. Maybe someone should take a > look at this at some point. Ok, thanks for the link. > >> >> >> >> Yes PRCM register accesses are unfortunately rather slow, and we've >> >> known that for some time, but haven't done any detailed analysis of the >> >> overhead. >> > That would be worth doing the analysis. A lot of read accesses to the >> > current, next and previous power states are performed in the idle >> > code. >> > >> >> Using the function_graph tracer, I was able to see that the pre/post >> >> transition are taking an enormous amount of time: >> >> >> >> - pwrdm pre-transition: 1400+ us at 600MHz (4000+ us at 125MHz) >> >> - pwrdm post-transtion: 1600+ us at 600MHz (6000+ us at 125MHz) >> >> >> >> Notice the big difference between 600MHz OPP and 125MHz OPP. Are you >> >> using CPUfreq at all in your tests? If using cpufreq + ondemand >> >> governor, you're probably running at low OPP due to lack of CPU activity >> >> which will also affect the latencies in the idle path. >> >> >> >>> Maybe some register cache would help us there, or are those registers >> >>> expected to be changed by hardware often? >> >> >> >> Yes, we've known that some sort of register cache here would be useful >> >> for some time, but haven't got to implementing it. >> > I can try some proof of concept code, just to prove its usefulness. >> > >> Please do so. We were hoping that after Tero's series, we don't need >> this pre/post stuff but am not sure if Tero is addressing that. >> >> Register cache initiative is most welcome. >> >> Regards >> Santosh > > Regards, Jean -- To unsubscribe from this list: send the line "unsubscribe linux-omap" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html