Re: [PATCH] libfdt: Remove special handling for unaligned reads

Patrice CHOTARD <patrice.chotard@xxxxxx> · Mon, 3 Feb 2020 13:14:15 +0000



Hi David

On 2/2/20 7:20 AM, David Gibson wrote:
> On Fri, Jan 31, 2020 at 08:44:41AM +0000, Patrice CHOTARD wrote:
>> Hi
>>
>> On 1/31/20 4:38 AM, David Gibson wrote:
>>> On Thu, Jan 30, 2020 at 02:18:21PM -0600, Rob Herring wrote:
>>>> On Thu, Jan 30, 2020 at 8:44 AM Patrice CHOTARD <patrice.chotard@xxxxxx> wrote:
>>>>> Hi Rob, Tom, David
>>>>>
>>>>> On 1/28/20 3:08 PM, Rob Herring wrote:
>>>>>> On Tue, Jan 28, 2020 at 7:43 AM Tom Rini <trini@xxxxxxxxxxxx> wrote:
>>>>>>> On Tue, Jan 28, 2020 at 07:59:18PM +1100, David Gibson wrote:
>>>>>>>> On Mon, Jan 27, 2020 at 10:04:34AM -0500, Tom Rini wrote:
>>>>>>>>> On Mon, Jan 27, 2020 at 02:23:51PM +1100, David Gibson wrote:
>>>>>>>>>> On Thu, Jan 23, 2020 at 07:16:50AM -0500, Tom Rini wrote:
>>>>>>>>>>> On Thu, Jan 23, 2020 at 08:14:40PM +1100, David Gibson wrote:
>>>>>>>>>>>> On Fri, Jan 17, 2020 at 10:31:06AM -0500, Tom Rini wrote:
>>> [snip]
>>>>>> Why not just fix the helpers for the aligned case and be done with it:
>>>>>>
>>>>>> static inline uint32_t fdt32_ld(const fdt32_t *p)
>>>>>> {
>>>>>>        const uint8_t *bp = (const uint8_t *)p;
>>>>>>
>>>>>> +       if (!((unsigned long)p & 0x3))
>>>>>> +               return fdt32_to_cpu(*p);
>>>>>> +
>>>>>>        return ((uint32_t)bp[0] << 24)
>>>>>>                | ((uint32_t)bp[1] << 16)
>>>>>>                | ((uint32_t)bp[2] << 8)
>>>>>>                | bp[3];
>>>>>> }
>>>>> Here are the results, before and after your proposed patch:
>>>>>
>>>>> tested tag: V2020.04-RC1
>>>>>
>>>>> STM32MP> bootstage report
>>>>> Timer summary in microseconds (12 records):
>>>>>        Mark    Elapsed  Stage
>>>>>           0          0  reset
>>>>>      84,268     84,268  SPL
>>>>>     962,921    878,653  end SPL
>>>>>     965,800      2,879  board_init_f
>>>>>   4,314,348  3,348,548  board_init_r
>>>>>   4,863,763    549,415  id=64
>>>>>   4,908,759     44,996  id=65
>>>>>   4,909,459        700  main_loop
>>>>>   5,322,309    412,850  id=175
>>>>>
>>>>> Accumulated time:
>>>>>                 83,284  dm_r
>>>>>                 95,842  dm_spl
>>>>>              1,502,020  dm_f
>>>>>
>>>>>
>>>>> -------------------------------------------------------------------------------
>>>>>
>>>>> tested tag : V2020.04-RC1 + following fdt32_ld patch :
>>>>>
>>>>> static inline uint32_t fdt32_ld(const fdt32_t *p)
>>>>> {
>>>>>        const uint8_t *bp = (const uint8_t *)p;
>>>>>
>>>>> +       if (!((unsigned long)p & 0x3))
>>>>> +               return fdt32_to_cpu(*p);
>>>>> +
>>>>>        return ((uint32_t)bp[0] << 24)
>>>>>                | ((uint32_t)bp[1] << 16)
>>>>>                | ((uint32_t)bp[2] << 8)
>>>>>                | bp[3];
>>>>> }
>>>>>
>>>>>
>>>>> STM32MP> bootstage report
>>>>> Timer summary in microseconds (12 records):
>>>>>        Mark    Elapsed  Stage
>>>>>           0          0  reset
>>>>>      84,264     84,264  SPL
>>>>>     959,300    875,036  end SPL
>>>>>     962,192      2,892  board_init_f
>>>>>   4,310,598  3,348,406  board_init_r
>>>>>   4,860,074    549,476  id=64
>>>>>   4,905,119     45,045  id=65
>>>>>   4,905,819        700  main_loop
>>>>>   5,098,636    192,817  id=175
>>>>>
>>>>> Accumulated time:
>>>>>                 83,202  dm_r
>>>>>                 95,252  dm_spl
>>>>>              1,501,950  dm_f
>>>>>
>>>>>
>>>>> There is no gain in board_init_r(), the added alignment test is expensive itself.
>>> Drat.
>>>
>>>> That's odd. It should be just an AND and a conditional branch added.
>>>> Those should be a few cycles compared to tens to hundreds of cycles
>>>> for 4 loads instead of 1.
>>> The later loads are very likely to hit in cache though, right?  Or is
>>> this done before the caches are activated?
>>>
>>>> Doesn't u-boot build with -Os typically? Maybe it's not actually
>>>> inlining.
>>> Apparently it is.  And yet... it seems pretty suspicious that the
>>> numbers are so close, despite doing something quite different here.
>>> Can we somehow verify that the fast path is really being executed?
>>>
>>> For comparison purposes, what are the numbers if we change these to
>>> unconditionally do an aligned load?
>>>
>> I made one more test to be 100% sure and the result is weird .... 
>>
>> static inline uint32_t fdt32_ld(const fdt32_t *p)
>> {
>>
>>        if (!((unsigned long)p & 0x3))
>>              return fdt32_to_cpu(*p);
>>
>>        while (1) {};
>> }
>>
>> the results are back to 'normal', board_init_r() execution is 1.9 second  faster .....
> Huh.  I dunno how, but it definitely seems like the conditional
> version wasn't operating properly.
>
> Does putting a "likely" branch prediction on that if make any difference?

I already did this test by adding a "likely" but it has no effect :-(

Patrice

>
>> STM32MP> bootstage report
>> Timer summary in microseconds (12 records):
>>        Mark    Elapsed  Stage
>>           0          0  reset
>>      84,265     84,265  SPL
>>     868,330    784,065  end SPL
>>     871,163      2,833  board_init_f
>>   2,360,135  1,488,972  board_init_r
>>   2,786,920    426,785  id=64
>>   2,818,189     31,269  id=65
>>   2,818,889        700  main_loop
>>   2,877,661     58,772  id=175
>>
>> Accumulated time:
>>                 54,489  dm_r
>>                 56,778  dm_spl
>>                644,883  dm_f
>>
>> Is the compiler doing some optimization link to the CPU pipeline prediction or something else ?
>>
>> Patrice