Hi David On 2/2/20 7:20 AM, David Gibson wrote: > On Fri, Jan 31, 2020 at 08:44:41AM +0000, Patrice CHOTARD wrote: >> Hi >> >> On 1/31/20 4:38 AM, David Gibson wrote: >>> On Thu, Jan 30, 2020 at 02:18:21PM -0600, Rob Herring wrote: >>>> On Thu, Jan 30, 2020 at 8:44 AM Patrice CHOTARD <patrice.chotard@xxxxxx> wrote: >>>>> Hi Rob, Tom, David >>>>> >>>>> On 1/28/20 3:08 PM, Rob Herring wrote: >>>>>> On Tue, Jan 28, 2020 at 7:43 AM Tom Rini <trini@xxxxxxxxxxxx> wrote: >>>>>>> On Tue, Jan 28, 2020 at 07:59:18PM +1100, David Gibson wrote: >>>>>>>> On Mon, Jan 27, 2020 at 10:04:34AM -0500, Tom Rini wrote: >>>>>>>>> On Mon, Jan 27, 2020 at 02:23:51PM +1100, David Gibson wrote: >>>>>>>>>> On Thu, Jan 23, 2020 at 07:16:50AM -0500, Tom Rini wrote: >>>>>>>>>>> On Thu, Jan 23, 2020 at 08:14:40PM +1100, David Gibson wrote: >>>>>>>>>>>> On Fri, Jan 17, 2020 at 10:31:06AM -0500, Tom Rini wrote: >>> [snip] >>>>>> Why not just fix the helpers for the aligned case and be done with it: >>>>>> >>>>>> static inline uint32_t fdt32_ld(const fdt32_t *p) >>>>>> { >>>>>> const uint8_t *bp = (const uint8_t *)p; >>>>>> >>>>>> + if (!((unsigned long)p & 0x3)) >>>>>> + return fdt32_to_cpu(*p); >>>>>> + >>>>>> return ((uint32_t)bp[0] << 24) >>>>>> | ((uint32_t)bp[1] << 16) >>>>>> | ((uint32_t)bp[2] << 8) >>>>>> | bp[3]; >>>>>> } >>>>> Here are the results, before and after your proposed patch: >>>>> >>>>> tested tag: V2020.04-RC1 >>>>> >>>>> STM32MP> bootstage report >>>>> Timer summary in microseconds (12 records): >>>>> Mark Elapsed Stage >>>>> 0 0 reset >>>>> 84,268 84,268 SPL >>>>> 962,921 878,653 end SPL >>>>> 965,800 2,879 board_init_f >>>>> 4,314,348 3,348,548 board_init_r >>>>> 4,863,763 549,415 id=64 >>>>> 4,908,759 44,996 id=65 >>>>> 4,909,459 700 main_loop >>>>> 5,322,309 412,850 id=175 >>>>> >>>>> Accumulated time: >>>>> 83,284 dm_r >>>>> 95,842 dm_spl >>>>> 1,502,020 dm_f >>>>> >>>>> >>>>> ------------------------------------------------------------------------------- >>>>> >>>>> tested tag : V2020.04-RC1 + following fdt32_ld patch : >>>>> >>>>> static inline uint32_t fdt32_ld(const fdt32_t *p) >>>>> { >>>>> const uint8_t *bp = (const uint8_t *)p; >>>>> >>>>> + if (!((unsigned long)p & 0x3)) >>>>> + return fdt32_to_cpu(*p); >>>>> + >>>>> return ((uint32_t)bp[0] << 24) >>>>> | ((uint32_t)bp[1] << 16) >>>>> | ((uint32_t)bp[2] << 8) >>>>> | bp[3]; >>>>> } >>>>> >>>>> >>>>> STM32MP> bootstage report >>>>> Timer summary in microseconds (12 records): >>>>> Mark Elapsed Stage >>>>> 0 0 reset >>>>> 84,264 84,264 SPL >>>>> 959,300 875,036 end SPL >>>>> 962,192 2,892 board_init_f >>>>> 4,310,598 3,348,406 board_init_r >>>>> 4,860,074 549,476 id=64 >>>>> 4,905,119 45,045 id=65 >>>>> 4,905,819 700 main_loop >>>>> 5,098,636 192,817 id=175 >>>>> >>>>> Accumulated time: >>>>> 83,202 dm_r >>>>> 95,252 dm_spl >>>>> 1,501,950 dm_f >>>>> >>>>> >>>>> There is no gain in board_init_r(), the added alignment test is expensive itself. >>> Drat. >>> >>>> That's odd. It should be just an AND and a conditional branch added. >>>> Those should be a few cycles compared to tens to hundreds of cycles >>>> for 4 loads instead of 1. >>> The later loads are very likely to hit in cache though, right? Or is >>> this done before the caches are activated? >>> >>>> Doesn't u-boot build with -Os typically? Maybe it's not actually >>>> inlining. >>> Apparently it is. And yet... it seems pretty suspicious that the >>> numbers are so close, despite doing something quite different here. >>> Can we somehow verify that the fast path is really being executed? >>> >>> For comparison purposes, what are the numbers if we change these to >>> unconditionally do an aligned load? >>> >> I made one more test to be 100% sure and the result is weird .... >> >> static inline uint32_t fdt32_ld(const fdt32_t *p) >> { >> >> if (!((unsigned long)p & 0x3)) >> return fdt32_to_cpu(*p); >> >> while (1) {}; >> } >> >> the results are back to 'normal', board_init_r() execution is 1.9 second faster ..... > Huh. I dunno how, but it definitely seems like the conditional > version wasn't operating properly. > > Does putting a "likely" branch prediction on that if make any difference? I already did this test by adding a "likely" but it has no effect :-( Patrice > >> STM32MP> bootstage report >> Timer summary in microseconds (12 records): >> Mark Elapsed Stage >> 0 0 reset >> 84,265 84,265 SPL >> 868,330 784,065 end SPL >> 871,163 2,833 board_init_f >> 2,360,135 1,488,972 board_init_r >> 2,786,920 426,785 id=64 >> 2,818,189 31,269 id=65 >> 2,818,889 700 main_loop >> 2,877,661 58,772 id=175 >> >> Accumulated time: >> 54,489 dm_r >> 56,778 dm_spl >> 644,883 dm_f >> >> Is the compiler doing some optimization link to the CPU pipeline prediction or something else ? >> >> Patrice