On Fri, Jan 31, 2020 at 08:44:41AM +0000, Patrice CHOTARD wrote: > Hi > > On 1/31/20 4:38 AM, David Gibson wrote: > > On Thu, Jan 30, 2020 at 02:18:21PM -0600, Rob Herring wrote: > >> On Thu, Jan 30, 2020 at 8:44 AM Patrice CHOTARD <patrice.chotard@xxxxxx> wrote: > >>> Hi Rob, Tom, David > >>> > >>> On 1/28/20 3:08 PM, Rob Herring wrote: > >>>> On Tue, Jan 28, 2020 at 7:43 AM Tom Rini <trini@xxxxxxxxxxxx> wrote: > >>>>> On Tue, Jan 28, 2020 at 07:59:18PM +1100, David Gibson wrote: > >>>>>> On Mon, Jan 27, 2020 at 10:04:34AM -0500, Tom Rini wrote: > >>>>>>> On Mon, Jan 27, 2020 at 02:23:51PM +1100, David Gibson wrote: > >>>>>>>> On Thu, Jan 23, 2020 at 07:16:50AM -0500, Tom Rini wrote: > >>>>>>>>> On Thu, Jan 23, 2020 at 08:14:40PM +1100, David Gibson wrote: > >>>>>>>>>> On Fri, Jan 17, 2020 at 10:31:06AM -0500, Tom Rini wrote: > > [snip] > >>>> Why not just fix the helpers for the aligned case and be done with it: > >>>> > >>>> static inline uint32_t fdt32_ld(const fdt32_t *p) > >>>> { > >>>> const uint8_t *bp = (const uint8_t *)p; > >>>> > >>>> + if (!((unsigned long)p & 0x3)) > >>>> + return fdt32_to_cpu(*p); > >>>> + > >>>> return ((uint32_t)bp[0] << 24) > >>>> | ((uint32_t)bp[1] << 16) > >>>> | ((uint32_t)bp[2] << 8) > >>>> | bp[3]; > >>>> } > >>> > >>> Here are the results, before and after your proposed patch: > >>> > >>> tested tag: V2020.04-RC1 > >>> > >>> STM32MP> bootstage report > >>> Timer summary in microseconds (12 records): > >>> Mark Elapsed Stage > >>> 0 0 reset > >>> 84,268 84,268 SPL > >>> 962,921 878,653 end SPL > >>> 965,800 2,879 board_init_f > >>> 4,314,348 3,348,548 board_init_r > >>> 4,863,763 549,415 id=64 > >>> 4,908,759 44,996 id=65 > >>> 4,909,459 700 main_loop > >>> 5,322,309 412,850 id=175 > >>> > >>> Accumulated time: > >>> 83,284 dm_r > >>> 95,842 dm_spl > >>> 1,502,020 dm_f > >>> > >>> > >>> ------------------------------------------------------------------------------- > >>> > >>> tested tag : V2020.04-RC1 + following fdt32_ld patch : > >>> > >>> static inline uint32_t fdt32_ld(const fdt32_t *p) > >>> { > >>> const uint8_t *bp = (const uint8_t *)p; > >>> > >>> + if (!((unsigned long)p & 0x3)) > >>> + return fdt32_to_cpu(*p); > >>> + > >>> return ((uint32_t)bp[0] << 24) > >>> | ((uint32_t)bp[1] << 16) > >>> | ((uint32_t)bp[2] << 8) > >>> | bp[3]; > >>> } > >>> > >>> > >>> STM32MP> bootstage report > >>> Timer summary in microseconds (12 records): > >>> Mark Elapsed Stage > >>> 0 0 reset > >>> 84,264 84,264 SPL > >>> 959,300 875,036 end SPL > >>> 962,192 2,892 board_init_f > >>> 4,310,598 3,348,406 board_init_r > >>> 4,860,074 549,476 id=64 > >>> 4,905,119 45,045 id=65 > >>> 4,905,819 700 main_loop > >>> 5,098,636 192,817 id=175 > >>> > >>> Accumulated time: > >>> 83,202 dm_r > >>> 95,252 dm_spl > >>> 1,501,950 dm_f > >>> > >>> > >>> There is no gain in board_init_r(), the added alignment test is expensive itself. > > Drat. > > > >> That's odd. It should be just an AND and a conditional branch added. > >> Those should be a few cycles compared to tens to hundreds of cycles > >> for 4 loads instead of 1. > > The later loads are very likely to hit in cache though, right? Or is > > this done before the caches are activated? > > > >> Doesn't u-boot build with -Os typically? Maybe it's not actually > >> inlining. > > Apparently it is. And yet... it seems pretty suspicious that the > > numbers are so close, despite doing something quite different here. > > Can we somehow verify that the fast path is really being executed? > > > > For comparison purposes, what are the numbers if we change these to > > unconditionally do an aligned load? > > > I made one more test to be 100% sure and the result is weird .... > > static inline uint32_t fdt32_ld(const fdt32_t *p) > { > > if (!((unsigned long)p & 0x3)) > return fdt32_to_cpu(*p); > > while (1) {}; > } > > the results are back to 'normal', board_init_r() execution is 1.9 second faster ..... Huh. I dunno how, but it definitely seems like the conditional version wasn't operating properly. Does putting a "likely" branch prediction on that if make any difference? > STM32MP> bootstage report > Timer summary in microseconds (12 records): > Mark Elapsed Stage > 0 0 reset > 84,265 84,265 SPL > 868,330 784,065 end SPL > 871,163 2,833 board_init_f > 2,360,135 1,488,972 board_init_r > 2,786,920 426,785 id=64 > 2,818,189 31,269 id=65 > 2,818,889 700 main_loop > 2,877,661 58,772 id=175 > > Accumulated time: > 54,489 dm_r > 56,778 dm_spl > 644,883 dm_f > > Is the compiler doing some optimization link to the CPU pipeline prediction or something else ? > > Patrice -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
Attachment:
signature.asc
Description: PGP signature