Re: [PATCH] libfdt: Remove special handling for unaligned reads

Patrice CHOTARD <patrice.chotard@xxxxxx> · Fri, 31 Jan 2020 08:44:41 +0000

Hi

On 1/31/20 4:38 AM, David Gibson wrote:
> On Thu, Jan 30, 2020 at 02:18:21PM -0600, Rob Herring wrote:
>> On Thu, Jan 30, 2020 at 8:44 AM Patrice CHOTARD <patrice.chotard@xxxxxx> wrote:
>>> Hi Rob, Tom, David
>>>
>>> On 1/28/20 3:08 PM, Rob Herring wrote:
>>>> On Tue, Jan 28, 2020 at 7:43 AM Tom Rini <trini@xxxxxxxxxxxx> wrote:
>>>>> On Tue, Jan 28, 2020 at 07:59:18PM +1100, David Gibson wrote:
>>>>>> On Mon, Jan 27, 2020 at 10:04:34AM -0500, Tom Rini wrote:
>>>>>>> On Mon, Jan 27, 2020 at 02:23:51PM +1100, David Gibson wrote:
>>>>>>>> On Thu, Jan 23, 2020 at 07:16:50AM -0500, Tom Rini wrote:
>>>>>>>>> On Thu, Jan 23, 2020 at 08:14:40PM +1100, David Gibson wrote:
>>>>>>>>>> On Fri, Jan 17, 2020 at 10:31:06AM -0500, Tom Rini wrote:
> [snip]
>>>> Why not just fix the helpers for the aligned case and be done with it:
>>>>
>>>> static inline uint32_t fdt32_ld(const fdt32_t *p)
>>>> {
>>>>        const uint8_t *bp = (const uint8_t *)p;
>>>>
>>>> +       if (!((unsigned long)p & 0x3))
>>>> +               return fdt32_to_cpu(*p);
>>>> +
>>>>        return ((uint32_t)bp[0] << 24)
>>>>                | ((uint32_t)bp[1] << 16)
>>>>                | ((uint32_t)bp[2] << 8)
>>>>                | bp[3];
>>>> }
>>>
>>> Here are the results, before and after your proposed patch:
>>>
>>> tested tag: V2020.04-RC1
>>>
>>> STM32MP> bootstage report
>>> Timer summary in microseconds (12 records):
>>>        Mark    Elapsed  Stage
>>>           0          0  reset
>>>      84,268     84,268  SPL
>>>     962,921    878,653  end SPL
>>>     965,800      2,879  board_init_f
>>>   4,314,348  3,348,548  board_init_r
>>>   4,863,763    549,415  id=64
>>>   4,908,759     44,996  id=65
>>>   4,909,459        700  main_loop
>>>   5,322,309    412,850  id=175
>>>
>>> Accumulated time:
>>>                 83,284  dm_r
>>>                 95,842  dm_spl
>>>              1,502,020  dm_f
>>>
>>>
>>> -------------------------------------------------------------------------------
>>>
>>> tested tag : V2020.04-RC1 + following fdt32_ld patch :
>>>
>>> static inline uint32_t fdt32_ld(const fdt32_t *p)
>>> {
>>>        const uint8_t *bp = (const uint8_t *)p;
>>>
>>> +       if (!((unsigned long)p & 0x3))
>>> +               return fdt32_to_cpu(*p);
>>> +
>>>        return ((uint32_t)bp[0] << 24)
>>>                | ((uint32_t)bp[1] << 16)
>>>                | ((uint32_t)bp[2] << 8)
>>>                | bp[3];
>>> }
>>>
>>>
>>> STM32MP> bootstage report
>>> Timer summary in microseconds (12 records):
>>>        Mark    Elapsed  Stage
>>>           0          0  reset
>>>      84,264     84,264  SPL
>>>     959,300    875,036  end SPL
>>>     962,192      2,892  board_init_f
>>>   4,310,598  3,348,406  board_init_r
>>>   4,860,074    549,476  id=64
>>>   4,905,119     45,045  id=65
>>>   4,905,819        700  main_loop
>>>   5,098,636    192,817  id=175
>>>
>>> Accumulated time:
>>>                 83,202  dm_r
>>>                 95,252  dm_spl
>>>              1,501,950  dm_f
>>>
>>>
>>> There is no gain in board_init_r(), the added alignment test is expensive itself.
> Drat.
>
>> That's odd. It should be just an AND and a conditional branch added.
>> Those should be a few cycles compared to tens to hundreds of cycles
>> for 4 loads instead of 1.
> The later loads are very likely to hit in cache though, right?  Or is
> this done before the caches are activated?
>
>> Doesn't u-boot build with -Os typically? Maybe it's not actually
>> inlining.
> Apparently it is.  And yet... it seems pretty suspicious that the
> numbers are so close, despite doing something quite different here.
> Can we somehow verify that the fast path is really being executed?
>
> For comparison purposes, what are the numbers if we change these to
> unconditionally do an aligned load?
>
I made one more test to be 100% sure and the result is weird .... 

static inline uint32_t fdt32_ld(const fdt32_t *p)
{

       if (!((unsigned long)p & 0x3))
             return fdt32_to_cpu(*p);

       while (1) {};
}

the results are back to 'normal', board_init_r() execution is 1.9 second  faster .....

STM32MP> bootstage report
Timer summary in microseconds (12 records):
       Mark    Elapsed  Stage
          0          0  reset
     84,265     84,265  SPL
    868,330    784,065  end SPL
    871,163      2,833  board_init_f
  2,360,135  1,488,972  board_init_r
  2,786,920    426,785  id=64
  2,818,189     31,269  id=65
  2,818,889        700  main_loop
  2,877,661     58,772  id=175

Accumulated time:
                54,489  dm_r
                56,778  dm_spl
               644,883  dm_f

Is the compiler doing some optimization link to the CPU pipeline prediction or something else ?

Patrice