Re: [PATCH 4.14] mm, slub: restore the original intention of prefetch_freepointer()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 4/27/20 9:01 AM, Sven Eckelmann wrote:
> On Monday, 27 April 2020 01:14:26 CEST Sasha Levin wrote:
>> On Sun, Apr 26, 2020 at 09:06:17AM +0200, Sven Eckelmann wrote:
>> >From: Vlastimil Babka <vbabka@xxxxxxx>
>> >
>> >commit 0882ff9190e3bc51e2d78c3aadd7c690eeaa91d5 upstream.
> [...]
>> >---
>> >The original problem is explained in the patch description as
>> >performance problem. And maybe this could also be one reason why it was
>> >never submitted for a stable kernel.
>> >
>> >But tests on mips ath79 (OpenWrt ar71xx target) showed that it most likely
>> >related to "random" data bus errors. At least applying this patch seemed to
>> >have solved it for Matthias Schiffer <mschiffer@xxxxxxxxxxxxxxxxxxxx> and
>> >some other persons who where debugging/testing this problem with him.
>> >
>> >More details about it can be found in
>> >https://github.com/freifunk-gluon/gluon/issues/1982

Hmm, doesn't explain much how the fix was eventually found, but nevermind, good job.

>> 
>> Interesting... I wonder why this issue has started only now.
> 
> Unfortunately, I don't know the details. So I (actually we) would love to get 
> some feedback from the slub experts. Not that there is another problem which 
> we just don't grasp yet.

I think the prefetch my go to an address that would cause a real fetch to page
fault. Under normal circumstances that could be only the NULL pointer that
terminates a freelist, otherwise the address should be valid.

So that could mean:
1) prefetch() on mips is implemented/compiled wrong?
2) the CPU really has issues with prefetch causing a page fault
3) the prefetch gets reordered between LL/SC and there's some bug similar to
this one described in arch/mips/include/asm/sync.h:

/*
 * Some Loongson 3 CPUs have a bug wherein execution of a memory access (load,
 * store or prefetch) in between an LL & SC can cause the SC instruction to
 * erroneously succeed, breaking atomicity. Whilst it's unusual to write code
 * containing such sequences, this bug bites harder than we might otherwise
 * expect due to reordering & speculation:


> Just some background information about the "why" from freifunk-gluon's 
> perspective:
> 
> OpenWrt 19.07 was released (despite its name) at the beginning of 2020. And it 
> was the first release using kernel 4.14 on the most used target: ar71xx 
> (ath79). The wireless community network firmware projects (freifunk-gluon in 
> this example) updated their frameworks to this OpenWrt release in the last 
> months and just now started to roll it out on their networks.
> 
> And while the wireless community networks around here usually don't track the 
> connected clients, the health of the APs is often tracked on some central 
> system. And some people then just noticed a sudden spike of reboots on their 
> APs. Since ar71xx is (often) the most used architecture at the moment, this 
> could be spotted rather easily if you spend some time looking at graphs.
> 
> Kind regards,
> 	Sven
> 




[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux