On Thu, Jan 30, 2020 at 02:18:21PM -0600, Rob Herring wrote: > On Thu, Jan 30, 2020 at 8:44 AM Patrice CHOTARD <patrice.chotard@xxxxxx> wrote: > > > > Hi Rob, Tom, David > > > > On 1/28/20 3:08 PM, Rob Herring wrote: > > > On Tue, Jan 28, 2020 at 7:43 AM Tom Rini <trini@xxxxxxxxxxxx> wrote: > > >> On Tue, Jan 28, 2020 at 07:59:18PM +1100, David Gibson wrote: > > >>> On Mon, Jan 27, 2020 at 10:04:34AM -0500, Tom Rini wrote: > > >>>> On Mon, Jan 27, 2020 at 02:23:51PM +1100, David Gibson wrote: > > >>>>> On Thu, Jan 23, 2020 at 07:16:50AM -0500, Tom Rini wrote: > > >>>>>> On Thu, Jan 23, 2020 at 08:14:40PM +1100, David Gibson wrote: > > >>>>>>> On Fri, Jan 17, 2020 at 10:31:06AM -0500, Tom Rini wrote: [snip] > > > Why not just fix the helpers for the aligned case and be done with it: > > > > > > static inline uint32_t fdt32_ld(const fdt32_t *p) > > > { > > > const uint8_t *bp = (const uint8_t *)p; > > > > > > + if (!((unsigned long)p & 0x3)) > > > + return fdt32_to_cpu(*p); > > > + > > > return ((uint32_t)bp[0] << 24) > > > | ((uint32_t)bp[1] << 16) > > > | ((uint32_t)bp[2] << 8) > > > | bp[3]; > > > } > > > > > > Here are the results, before and after your proposed patch: > > > > tested tag: V2020.04-RC1 > > > > STM32MP> bootstage report > > Timer summary in microseconds (12 records): > > Mark Elapsed Stage > > 0 0 reset > > 84,268 84,268 SPL > > 962,921 878,653 end SPL > > 965,800 2,879 board_init_f > > 4,314,348 3,348,548 board_init_r > > 4,863,763 549,415 id=64 > > 4,908,759 44,996 id=65 > > 4,909,459 700 main_loop > > 5,322,309 412,850 id=175 > > > > Accumulated time: > > 83,284 dm_r > > 95,842 dm_spl > > 1,502,020 dm_f > > > > > > ------------------------------------------------------------------------------- > > > > tested tag : V2020.04-RC1 + following fdt32_ld patch : > > > > static inline uint32_t fdt32_ld(const fdt32_t *p) > > { > > const uint8_t *bp = (const uint8_t *)p; > > > > + if (!((unsigned long)p & 0x3)) > > + return fdt32_to_cpu(*p); > > + > > return ((uint32_t)bp[0] << 24) > > | ((uint32_t)bp[1] << 16) > > | ((uint32_t)bp[2] << 8) > > | bp[3]; > > } > > > > > > STM32MP> bootstage report > > Timer summary in microseconds (12 records): > > Mark Elapsed Stage > > 0 0 reset > > 84,264 84,264 SPL > > 959,300 875,036 end SPL > > 962,192 2,892 board_init_f > > 4,310,598 3,348,406 board_init_r > > 4,860,074 549,476 id=64 > > 4,905,119 45,045 id=65 > > 4,905,819 700 main_loop > > 5,098,636 192,817 id=175 > > > > Accumulated time: > > 83,202 dm_r > > 95,252 dm_spl > > 1,501,950 dm_f > > > > > > There is no gain in board_init_r(), the added alignment test is expensive itself. Drat. > That's odd. It should be just an AND and a conditional branch added. > Those should be a few cycles compared to tens to hundreds of cycles > for 4 loads instead of 1. The later loads are very likely to hit in cache though, right? Or is this done before the caches are activated? > Doesn't u-boot build with -Os typically? Maybe it's not actually > inlining. Apparently it is. And yet... it seems pretty suspicious that the numbers are so close, despite doing something quite different here. Can we somehow verify that the fast path is really being executed? For comparison purposes, what are the numbers if we change these to unconditionally do an aligned load? -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
Attachment:
signature.asc
Description: PGP signature