Re: [PATCH] libfdt: Remove special handling for unaligned reads

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]



On Mon, Feb 03, 2020 at 01:14:15PM +0000, Patrice CHOTARD wrote:
> Hi David
> 
> On 2/2/20 7:20 AM, David Gibson wrote:
> > On Fri, Jan 31, 2020 at 08:44:41AM +0000, Patrice CHOTARD wrote:
> >> Hi
> >>
> >> On 1/31/20 4:38 AM, David Gibson wrote:
> >>> On Thu, Jan 30, 2020 at 02:18:21PM -0600, Rob Herring wrote:
> >>>> On Thu, Jan 30, 2020 at 8:44 AM Patrice CHOTARD <patrice.chotard@xxxxxx> wrote:
> >>>>> Hi Rob, Tom, David
> >>>>>
> >>>>> On 1/28/20 3:08 PM, Rob Herring wrote:
> >>>>>> On Tue, Jan 28, 2020 at 7:43 AM Tom Rini <trini@xxxxxxxxxxxx> wrote:
> >>>>>>> On Tue, Jan 28, 2020 at 07:59:18PM +1100, David Gibson wrote:
> >>>>>>>> On Mon, Jan 27, 2020 at 10:04:34AM -0500, Tom Rini wrote:
> >>>>>>>>> On Mon, Jan 27, 2020 at 02:23:51PM +1100, David Gibson wrote:
> >>>>>>>>>> On Thu, Jan 23, 2020 at 07:16:50AM -0500, Tom Rini wrote:
> >>>>>>>>>>> On Thu, Jan 23, 2020 at 08:14:40PM +1100, David Gibson wrote:
> >>>>>>>>>>>> On Fri, Jan 17, 2020 at 10:31:06AM -0500, Tom Rini wrote:
> >>> [snip]
> >>>>>> Why not just fix the helpers for the aligned case and be done with it:
> >>>>>>
> >>>>>> static inline uint32_t fdt32_ld(const fdt32_t *p)
> >>>>>> {
> >>>>>>        const uint8_t *bp = (const uint8_t *)p;
> >>>>>>
> >>>>>> +       if (!((unsigned long)p & 0x3))
> >>>>>> +               return fdt32_to_cpu(*p);
> >>>>>> +
> >>>>>>        return ((uint32_t)bp[0] << 24)
> >>>>>>                | ((uint32_t)bp[1] << 16)
> >>>>>>                | ((uint32_t)bp[2] << 8)
> >>>>>>                | bp[3];
> >>>>>> }
> >>>>> Here are the results, before and after your proposed patch:
> >>>>>
> >>>>> tested tag: V2020.04-RC1
> >>>>>
> >>>>> STM32MP> bootstage report
> >>>>> Timer summary in microseconds (12 records):
> >>>>>        Mark    Elapsed  Stage
> >>>>>           0          0  reset
> >>>>>      84,268     84,268  SPL
> >>>>>     962,921    878,653  end SPL
> >>>>>     965,800      2,879  board_init_f
> >>>>>   4,314,348  3,348,548  board_init_r
> >>>>>   4,863,763    549,415  id=64
> >>>>>   4,908,759     44,996  id=65
> >>>>>   4,909,459        700  main_loop
> >>>>>   5,322,309    412,850  id=175
> >>>>>
> >>>>> Accumulated time:
> >>>>>                 83,284  dm_r
> >>>>>                 95,842  dm_spl
> >>>>>              1,502,020  dm_f
> >>>>>
> >>>>>
> >>>>> -------------------------------------------------------------------------------
> >>>>>
> >>>>> tested tag : V2020.04-RC1 + following fdt32_ld patch :
> >>>>>
> >>>>> static inline uint32_t fdt32_ld(const fdt32_t *p)
> >>>>> {
> >>>>>        const uint8_t *bp = (const uint8_t *)p;
> >>>>>
> >>>>> +       if (!((unsigned long)p & 0x3))
> >>>>> +               return fdt32_to_cpu(*p);
> >>>>> +
> >>>>>        return ((uint32_t)bp[0] << 24)
> >>>>>                | ((uint32_t)bp[1] << 16)
> >>>>>                | ((uint32_t)bp[2] << 8)
> >>>>>                | bp[3];
> >>>>> }
> >>>>>
> >>>>>
> >>>>> STM32MP> bootstage report
> >>>>> Timer summary in microseconds (12 records):
> >>>>>        Mark    Elapsed  Stage
> >>>>>           0          0  reset
> >>>>>      84,264     84,264  SPL
> >>>>>     959,300    875,036  end SPL
> >>>>>     962,192      2,892  board_init_f
> >>>>>   4,310,598  3,348,406  board_init_r
> >>>>>   4,860,074    549,476  id=64
> >>>>>   4,905,119     45,045  id=65
> >>>>>   4,905,819        700  main_loop
> >>>>>   5,098,636    192,817  id=175
> >>>>>
> >>>>> Accumulated time:
> >>>>>                 83,202  dm_r
> >>>>>                 95,252  dm_spl
> >>>>>              1,501,950  dm_f
> >>>>>
> >>>>>
> >>>>> There is no gain in board_init_r(), the added alignment test is expensive itself.
> >>> Drat.
> >>>
> >>>> That's odd. It should be just an AND and a conditional branch added.
> >>>> Those should be a few cycles compared to tens to hundreds of cycles
> >>>> for 4 loads instead of 1.
> >>> The later loads are very likely to hit in cache though, right?  Or is
> >>> this done before the caches are activated?
> >>>
> >>>> Doesn't u-boot build with -Os typically? Maybe it's not actually
> >>>> inlining.
> >>> Apparently it is.  And yet... it seems pretty suspicious that the
> >>> numbers are so close, despite doing something quite different here.
> >>> Can we somehow verify that the fast path is really being executed?
> >>>
> >>> For comparison purposes, what are the numbers if we change these to
> >>> unconditionally do an aligned load?
> >>>
> >> I made one more test to be 100% sure and the result is weird .... 
> >>
> >> static inline uint32_t fdt32_ld(const fdt32_t *p)
> >> {
> >>
> >>        if (!((unsigned long)p & 0x3))
> >>              return fdt32_to_cpu(*p);
> >>
> >>        while (1) {};
> >> }
> >>
> >> the results are back to 'normal', board_init_r() execution is 1.9 second  faster .....
> > Huh.  I dunno how, but it definitely seems like the conditional
> > version wasn't operating properly.
> >
> > Does putting a "likely" branch prediction on that if make any difference?
> 
> I already did this test by adding a "likely" but it has no effect
> :-(

Interesting.  I think someone who knows ARM asm needs to look at the
compiler output for the two cases to see what's going on.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Device Tree]     [Device Tree Spec]     [Linux Driver Backports]     [Video for Linux]     [Linux USB Devel]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [Yosemite Backpacking]

  Powered by Linux