Re: [PATCH] libfdt: Remove special handling for unaligned reads

David Gibson <david@xxxxxxxxxxxxxxxxxxxxx> · Sun, 2 Feb 2020 17:20:31 +1100

On Fri, Jan 31, 2020 at 08:44:41AM +0000, Patrice CHOTARD wrote:
> Hi
> 
> On 1/31/20 4:38 AM, David Gibson wrote:
> > On Thu, Jan 30, 2020 at 02:18:21PM -0600, Rob Herring wrote:
> >> On Thu, Jan 30, 2020 at 8:44 AM Patrice CHOTARD <patrice.chotard@xxxxxx> wrote:
> >>> Hi Rob, Tom, David
> >>>
> >>> On 1/28/20 3:08 PM, Rob Herring wrote:
> >>>> On Tue, Jan 28, 2020 at 7:43 AM Tom Rini <trini@xxxxxxxxxxxx> wrote:
> >>>>> On Tue, Jan 28, 2020 at 07:59:18PM +1100, David Gibson wrote:
> >>>>>> On Mon, Jan 27, 2020 at 10:04:34AM -0500, Tom Rini wrote:
> >>>>>>> On Mon, Jan 27, 2020 at 02:23:51PM +1100, David Gibson wrote:
> >>>>>>>> On Thu, Jan 23, 2020 at 07:16:50AM -0500, Tom Rini wrote:
> >>>>>>>>> On Thu, Jan 23, 2020 at 08:14:40PM +1100, David Gibson wrote:
> >>>>>>>>>> On Fri, Jan 17, 2020 at 10:31:06AM -0500, Tom Rini wrote:
> > [snip]
> >>>> Why not just fix the helpers for the aligned case and be done with it:
> >>>>
> >>>> static inline uint32_t fdt32_ld(const fdt32_t *p)
> >>>> {
> >>>>        const uint8_t *bp = (const uint8_t *)p;
> >>>>
> >>>> +       if (!((unsigned long)p & 0x3))
> >>>> +               return fdt32_to_cpu(*p);
> >>>> +
> >>>>        return ((uint32_t)bp[0] << 24)
> >>>>                | ((uint32_t)bp[1] << 16)
> >>>>                | ((uint32_t)bp[2] << 8)
> >>>>                | bp[3];
> >>>> }
> >>>
> >>> Here are the results, before and after your proposed patch:
> >>>
> >>> tested tag: V2020.04-RC1
> >>>
> >>> STM32MP> bootstage report
> >>> Timer summary in microseconds (12 records):
> >>>        Mark    Elapsed  Stage
> >>>           0          0  reset
> >>>      84,268     84,268  SPL
> >>>     962,921    878,653  end SPL
> >>>     965,800      2,879  board_init_f
> >>>   4,314,348  3,348,548  board_init_r
> >>>   4,863,763    549,415  id=64
> >>>   4,908,759     44,996  id=65
> >>>   4,909,459        700  main_loop
> >>>   5,322,309    412,850  id=175
> >>>
> >>> Accumulated time:
> >>>                 83,284  dm_r
> >>>                 95,842  dm_spl
> >>>              1,502,020  dm_f
> >>>
> >>>
> >>> -------------------------------------------------------------------------------
> >>>
> >>> tested tag : V2020.04-RC1 + following fdt32_ld patch :
> >>>
> >>> static inline uint32_t fdt32_ld(const fdt32_t *p)
> >>> {
> >>>        const uint8_t *bp = (const uint8_t *)p;
> >>>
> >>> +       if (!((unsigned long)p & 0x3))
> >>> +               return fdt32_to_cpu(*p);
> >>> +
> >>>        return ((uint32_t)bp[0] << 24)
> >>>                | ((uint32_t)bp[1] << 16)
> >>>                | ((uint32_t)bp[2] << 8)
> >>>                | bp[3];
> >>> }
> >>>
> >>>
> >>> STM32MP> bootstage report
> >>> Timer summary in microseconds (12 records):
> >>>        Mark    Elapsed  Stage
> >>>           0          0  reset
> >>>      84,264     84,264  SPL
> >>>     959,300    875,036  end SPL
> >>>     962,192      2,892  board_init_f
> >>>   4,310,598  3,348,406  board_init_r
> >>>   4,860,074    549,476  id=64
> >>>   4,905,119     45,045  id=65
> >>>   4,905,819        700  main_loop
> >>>   5,098,636    192,817  id=175
> >>>
> >>> Accumulated time:
> >>>                 83,202  dm_r
> >>>                 95,252  dm_spl
> >>>              1,501,950  dm_f
> >>>
> >>>
> >>> There is no gain in board_init_r(), the added alignment test is expensive itself.
> > Drat.
> >
> >> That's odd. It should be just an AND and a conditional branch added.
> >> Those should be a few cycles compared to tens to hundreds of cycles
> >> for 4 loads instead of 1.
> > The later loads are very likely to hit in cache though, right?  Or is
> > this done before the caches are activated?
> >
> >> Doesn't u-boot build with -Os typically? Maybe it's not actually
> >> inlining.
> > Apparently it is.  And yet... it seems pretty suspicious that the
> > numbers are so close, despite doing something quite different here.
> > Can we somehow verify that the fast path is really being executed?
> >
> > For comparison purposes, what are the numbers if we change these to
> > unconditionally do an aligned load?
> >
> I made one more test to be 100% sure and the result is weird .... 
> 
> static inline uint32_t fdt32_ld(const fdt32_t *p)
> {
> 
>        if (!((unsigned long)p & 0x3))
>              return fdt32_to_cpu(*p);
> 
>        while (1) {};
> }
> 
> the results are back to 'normal', board_init_r() execution is 1.9 second  faster .....

Huh.  I dunno how, but it definitely seems like the conditional
version wasn't operating properly.

Does putting a "likely" branch prediction on that if make any difference?

> STM32MP> bootstage report
> Timer summary in microseconds (12 records):
>        Mark    Elapsed  Stage
>           0          0  reset
>      84,265     84,265  SPL
>     868,330    784,065  end SPL
>     871,163      2,833  board_init_f
>   2,360,135  1,488,972  board_init_r
>   2,786,920    426,785  id=64
>   2,818,189     31,269  id=65
>   2,818,889        700  main_loop
>   2,877,661     58,772  id=175
> 
> Accumulated time:
>                 54,489  dm_r
>                 56,778  dm_spl
>                644,883  dm_f
> 
> Is the compiler doing some optimization link to the CPU pipeline prediction or something else ?
> 
> Patrice

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson
Attachment:
signature.asc

Description: PGP signature