On 4/27/20 9:01 AM, Sven Eckelmann wrote: > On Monday, 27 April 2020 01:14:26 CEST Sasha Levin wrote: >> On Sun, Apr 26, 2020 at 09:06:17AM +0200, Sven Eckelmann wrote: >> >From: Vlastimil Babka <vbabka@xxxxxxx> >> > >> >commit 0882ff9190e3bc51e2d78c3aadd7c690eeaa91d5 upstream. > [...] >> >--- >> >The original problem is explained in the patch description as >> >performance problem. And maybe this could also be one reason why it was >> >never submitted for a stable kernel. >> > >> >But tests on mips ath79 (OpenWrt ar71xx target) showed that it most likely >> >related to "random" data bus errors. At least applying this patch seemed to >> >have solved it for Matthias Schiffer <mschiffer@xxxxxxxxxxxxxxxxxxxx> and >> >some other persons who where debugging/testing this problem with him. >> > >> >More details about it can be found in >> >https://github.com/freifunk-gluon/gluon/issues/1982 Hmm, doesn't explain much how the fix was eventually found, but nevermind, good job. >> >> Interesting... I wonder why this issue has started only now. > > Unfortunately, I don't know the details. So I (actually we) would love to get > some feedback from the slub experts. Not that there is another problem which > we just don't grasp yet. I think the prefetch my go to an address that would cause a real fetch to page fault. Under normal circumstances that could be only the NULL pointer that terminates a freelist, otherwise the address should be valid. So that could mean: 1) prefetch() on mips is implemented/compiled wrong? 2) the CPU really has issues with prefetch causing a page fault 3) the prefetch gets reordered between LL/SC and there's some bug similar to this one described in arch/mips/include/asm/sync.h: /* * Some Loongson 3 CPUs have a bug wherein execution of a memory access (load, * store or prefetch) in between an LL & SC can cause the SC instruction to * erroneously succeed, breaking atomicity. Whilst it's unusual to write code * containing such sequences, this bug bites harder than we might otherwise * expect due to reordering & speculation: > Just some background information about the "why" from freifunk-gluon's > perspective: > > OpenWrt 19.07 was released (despite its name) at the beginning of 2020. And it > was the first release using kernel 4.14 on the most used target: ar71xx > (ath79). The wireless community network firmware projects (freifunk-gluon in > this example) updated their frameworks to this OpenWrt release in the last > months and just now started to roll it out on their networks. > > And while the wireless community networks around here usually don't track the > connected clients, the health of the APs is often tracked on some central > system. And some people then just noticed a sudden spike of reboots on their > APs. Since ar71xx is (often) the most used architecture at the moment, this > could be spotted rather easily if you spend some time looking at graphs. > > Kind regards, > Sven >