On Fri, Jan 31, 2020 at 02:38:10PM +1100, David Gibson wrote: > On Thu, Jan 30, 2020 at 02:18:21PM -0600, Rob Herring wrote: > > On Thu, Jan 30, 2020 at 8:44 AM Patrice CHOTARD <patrice.chotard@xxxxxx> wrote: > > > > > > Hi Rob, Tom, David > > > > > > On 1/28/20 3:08 PM, Rob Herring wrote: > > > > On Tue, Jan 28, 2020 at 7:43 AM Tom Rini <trini@xxxxxxxxxxxx> wrote: > > > >> On Tue, Jan 28, 2020 at 07:59:18PM +1100, David Gibson wrote: > > > >>> On Mon, Jan 27, 2020 at 10:04:34AM -0500, Tom Rini wrote: > > > >>>> On Mon, Jan 27, 2020 at 02:23:51PM +1100, David Gibson wrote: > > > >>>>> On Thu, Jan 23, 2020 at 07:16:50AM -0500, Tom Rini wrote: > > > >>>>>> On Thu, Jan 23, 2020 at 08:14:40PM +1100, David Gibson wrote: > > > >>>>>>> On Fri, Jan 17, 2020 at 10:31:06AM -0500, Tom Rini wrote: > [snip] > > > > Why not just fix the helpers for the aligned case and be done with it: > > > > > > > > static inline uint32_t fdt32_ld(const fdt32_t *p) > > > > { > > > > const uint8_t *bp = (const uint8_t *)p; > > > > > > > > + if (!((unsigned long)p & 0x3)) > > > > + return fdt32_to_cpu(*p); > > > > + > > > > return ((uint32_t)bp[0] << 24) > > > > | ((uint32_t)bp[1] << 16) > > > > | ((uint32_t)bp[2] << 8) > > > > | bp[3]; > > > > } > > > > > > > > > Here are the results, before and after your proposed patch: > > > > > > tested tag: V2020.04-RC1 > > > > > > STM32MP> bootstage report > > > Timer summary in microseconds (12 records): > > > Mark Elapsed Stage > > > 0 0 reset > > > 84,268 84,268 SPL > > > 962,921 878,653 end SPL > > > 965,800 2,879 board_init_f > > > 4,314,348 3,348,548 board_init_r > > > 4,863,763 549,415 id=64 > > > 4,908,759 44,996 id=65 > > > 4,909,459 700 main_loop > > > 5,322,309 412,850 id=175 > > > > > > Accumulated time: > > > 83,284 dm_r > > > 95,842 dm_spl > > > 1,502,020 dm_f > > > > > > > > > ------------------------------------------------------------------------------- > > > > > > tested tag : V2020.04-RC1 + following fdt32_ld patch : > > > > > > static inline uint32_t fdt32_ld(const fdt32_t *p) > > > { > > > const uint8_t *bp = (const uint8_t *)p; > > > > > > + if (!((unsigned long)p & 0x3)) > > > + return fdt32_to_cpu(*p); > > > + > > > return ((uint32_t)bp[0] << 24) > > > | ((uint32_t)bp[1] << 16) > > > | ((uint32_t)bp[2] << 8) > > > | bp[3]; > > > } > > > > > > > > > STM32MP> bootstage report > > > Timer summary in microseconds (12 records): > > > Mark Elapsed Stage > > > 0 0 reset > > > 84,264 84,264 SPL > > > 959,300 875,036 end SPL > > > 962,192 2,892 board_init_f > > > 4,310,598 3,348,406 board_init_r > > > 4,860,074 549,476 id=64 > > > 4,905,119 45,045 id=65 > > > 4,905,819 700 main_loop > > > 5,098,636 192,817 id=175 > > > > > > Accumulated time: > > > 83,202 dm_r > > > 95,252 dm_spl > > > 1,501,950 dm_f > > > > > > > > > There is no gain in board_init_r(), the added alignment test is expensive itself. > > Drat. > > > That's odd. It should be just an AND and a conditional branch added. > > Those should be a few cycles compared to tens to hundreds of cycles > > for 4 loads instead of 1. > > The later loads are very likely to hit in cache though, right? Or is > this done before the caches are activated? > > > Doesn't u-boot build with -Os typically? Maybe it's not actually > > inlining. > > Apparently it is. And yet... it seems pretty suspicious that the > numbers are so close, despite doing something quite different here. > Can we somehow verify that the fast path is really being executed? > > For comparison purposes, what are the numbers if we change these to > unconditionally do an aligned load? The before / after numbers can be seen at https://lists.denx.de/pipermail/u-boot/2020-January/396707.html -- Tom
Attachment:
signature.asc
Description: PGP signature