Re: [patch for-5.3 0/4] revert immediate fallback to remote hugepages

David Rientjes <rientjes@xxxxxxxxxx> · Sat, 7 Sep 2019 18:50:57 -0700 (PDT)

On Sat, 7 Sep 2019, Linus Torvalds wrote:

> > Andrea acknowledges the swap storm that he reported would be fixed with
> > the last two patches in this series
> 
> The problem is that even you aren't arguing that those patches should
> go into 5.3.
> 

For three reasons: (a) we lack a test result from Andrea, (b) there's 
on-going discussion, particularly based on Vlastimil's feedback, and 
(c) the patches will be refreshed incorporating that feedback as well as 
Mike's suggestion to exempt __GFP_RETRY_MAYFAIL for hugetlb.

> So those fixes aren't going in, so "the swap storms would be fixed"
> argument isn't actually an argument at all as far as 5.3 is concerned.
> 

It indicates that progress has been made to address the actual bug without 
introducing long-lived access latency regressions for others, particularly 
those who use MADV_HUGEPAGE.  In the worst case, some systems running 
5.3-rc4 and 5.3-rc5 have the same amount of memory backed by hugepages but 
on 5.3-rc5 the vast majority of it is allocated remotely.  This incurs a 
signficant performance regression regardless of platform; the only thing 
needed to induce this is a fragmented local node that would otherwise be 
compacted in 5.3-rc4 rather than quickly allocate remote on 5.3-rc5.

> End result: we'd have the qemu-kvm instance performance problem in 5.3
> that apparently causes distros to apply those patches that you want to
> revert anyway.
> 
> So reverting would just make distros not use 5.3 in that form.
> 

I'm arguing to revert 5.3 back to the behavior that we have had for years 
and actually fix the bug that everybody else seems to be ignoring and then 
*backport* those fixes to 5.3 stable and every other stable tree that can 
use them.  Introducing a new mempolicy for NUMA locality into 5.3.0 that 
will subsequently changed in future 5.3 stable kernels and differs from 
all kernels from the past few years is not in anybody's best interest if 
the actual problem can be fixed.  It requires more feedback than a 
one-line "the swap storms would be fixed with this."  That collaboration 
takes time and isn't something that should be rushed into 5.3-rc5.

Yes, we can fix NUMA locality of hugepages when a workload like qemu is 
larger than a single socket; the vast majority of workloads in the 
datacenter are small than a socket and *cannot* incur the performance 
penalty if local memory is fragmented that 5.3-rc5 introduces.

In other words, 5.3-rc5 is only fixing a highly specialized usecase where 
remote allocation is acceptable because the workload is larger than a 
socket *and* remote memory is not low on memory or fragmented.  If you 
consider the opposite of that, workloads smaller than a socket or local 
compaction actually works, this has introduced a measurable regression for 
everybody else.

I'm not sure why we are ignoring a painfully obvious bug in the page 
allocator because of a poor feedback loop between itself and memory 
compaction and rather papering over it by falling back to remote memory 
when NUMA actually does matter.  If you release 5.3 without the first two 
patches in this series, I wouldn't expect any additional feedback or test 
results to fix this bug considering all we have gotten so far is "this 
would fix this swap storms" and not collaborating to fix the issue for 
everybody rather than only caring about their own workloads.  At least my 
patches acknowledge and try to fix the issue the other is encountering.