Re: [PATCH 4.14] mm, slub: restore the original intention of prefetch_freepointer()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 4/27/20 10:45 AM, Vlastimil Babka wrote:
> On 4/27/20 9:01 AM, Sven Eckelmann wrote:
>> On Monday, 27 April 2020 01:14:26 CEST Sasha Levin wrote:
>>> On Sun, Apr 26, 2020 at 09:06:17AM +0200, Sven Eckelmann wrote:
>>>> From: Vlastimil Babka <vbabka@xxxxxxx>
>>>>
>>>> commit 0882ff9190e3bc51e2d78c3aadd7c690eeaa91d5 upstream.
>> [...]
>>>> ---
>>>> The original problem is explained in the patch description as
>>>> performance problem. And maybe this could also be one reason why it was
>>>> never submitted for a stable kernel.
>>>>
>>>> But tests on mips ath79 (OpenWrt ar71xx target) showed that it most likely
>>>> related to "random" data bus errors. At least applying this patch seemed to
>>>> have solved it for Matthias Schiffer <mschiffer@xxxxxxxxxxxxxxxxxxxx> and
>>>> some other persons who where debugging/testing this problem with him.
>>>>
>>>> More details about it can be found in
>>>> https://github.com/freifunk-gluon/gluon/issues/1982
> 
> Hmm, doesn't explain much how the fix was eventually found, but nevermind, good job.

The fact that the location of the data bus error was so imprecise made me
suspect that no regular load could be the cause - therefore I looked at
that prefetch in detail and eventually found your patch.

> 
>>>
>>> Interesting... I wonder why this issue has started only now.
>>
>> Unfortunately, I don't know the details. So I (actually we) would love to get 
>> some feedback from the slub experts. Not that there is another problem which 
>> we just don't grasp yet.
> 
> I think the prefetch my go to an address that would cause a real fetch to page
> fault. Under normal circumstances that could be only the NULL pointer that
> terminates a freelist, otherwise the address should be valid.

For further analysis, I just replaced the prefetch as implemented in 4.14
(i.e. before applying the patch in question) with a regular load (excluding
NULL, which would lead to an immediate fault on boot). With the test
program, I quickly hit a fault, at an address that looks completely bogus
(i.e. neither NULL nor an address looking like it might be mapped to
anything). Is this expected with the incorrect prefetch_freepointer()
implementation of 4.14? Is it possible that prefetch_freepointer() of 4.14
is even more broken than suspected before? Note that we hit this issue with
the "names_cache" slab, which has page-sized objects, if that might provide
any clue...

In any case, it seems like the "pref" instruction should not be used on
bogus addresses on the ath79 platform... The exact behaviour is also
hardware-dependent: On some SoCs, the error would be visible as a data bus
error, while others just reset without any way to find out what was going
wrong.

Matthias

> 
> So that could mean:
> 1) prefetch() on mips is implemented/compiled wrong?
> 2) the CPU really has issues with prefetch causing a page fault
> 3) the prefetch gets reordered between LL/SC and there's some bug similar to
> this one described in arch/mips/include/asm/sync.h:
> 
> /*
>  * Some Loongson 3 CPUs have a bug wherein execution of a memory access (load,
>  * store or prefetch) in between an LL & SC can cause the SC instruction to
>  * erroneously succeed, breaking atomicity. Whilst it's unusual to write code
>  * containing such sequences, this bug bites harder than we might otherwise
>  * expect due to reordering & speculation:
> 
> 
>> Just some background information about the "why" from freifunk-gluon's 
>> perspective:
>>
>> OpenWrt 19.07 was released (despite its name) at the beginning of 2020. And it 
>> was the first release using kernel 4.14 on the most used target: ar71xx 
>> (ath79). The wireless community network firmware projects (freifunk-gluon in 
>> this example) updated their frameworks to this OpenWrt release in the last 
>> months and just now started to roll it out on their networks.
>>
>> And while the wireless community networks around here usually don't track the 
>> connected clients, the health of the APs is often tracked on some central 
>> system. And some people then just noticed a sudden spike of reboots on their 
>> APs. Since ar71xx is (often) the most used architecture at the moment, this 
>> could be spotted rather easily if you spend some time looking at graphs.
>>
>> Kind regards,
>> 	Sven
>>
> 


Attachment: signature.asc
Description: OpenPGP digital signature


[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux