Patch "mm: page_alloc: control latency caused by zone PCP draining" has been added to the 6.1-stable tree

Sasha Levin <sashal@xxxxxxxxxx> · Sat, 3 Aug 2024 10:54:09 -0400

This is a note to let you know that I've just added the patch titled

    mm: page_alloc: control latency caused by zone PCP draining

to the 6.1-stable tree which can be found at:
    http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
     mm-page_alloc-control-latency-caused-by-zone-pcp-dra.patch
and it can be found in the queue-6.1 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let <stable@xxxxxxxxxxxxxxx> know about it.



commit 458b05c581cf671aa8ed527858c142ba78ebf8c0
Author: Lucas Stach <l.stach@xxxxxxxxxxxxxx>
Date:   Mon Mar 18 21:07:36 2024 +0100

    mm: page_alloc: control latency caused by zone PCP draining
    
    [ Upstream commit 55f77df7d715110299f12c27f4365bd6332d1adb ]
    
    Patch series "mm/treewide: Remove pXd_huge() API", v2.
    
    In previous work [1], we removed the pXd_large() API, which is arch
    specific.  This patchset further removes the hugetlb pXd_huge() API.
    
    Hugetlb was never special on creating huge mappings when compared with
    other huge mappings.  Having a standalone API just to detect such pgtable
    entries is more or less redundant, especially after the pXd_leaf() API set
    is introduced with/without CONFIG_HUGETLB_PAGE.
    
    When looking at this problem, a few issues are also exposed that we don't
    have a clear definition of the *_huge() variance API.  This patchset
    started by cleaning these issues first, then replace all *_huge() users to
    use *_leaf(), then drop all *_huge() code.
    
    On x86/sparc, swap entries will be reported "true" in pXd_huge(), while
    for all the rest archs they're reported "false" instead.  This part is
    done in patch 1-5, in which I suspect patch 1 can be seen as a bug fix,
    but I'll leave that to hmm experts to decide.
    
    Besides, there are three archs (arm, arm64, powerpc) that have slightly
    different definitions between the *_huge() v.s.  *_leaf() variances.  I
    tackled them separately so that it'll be easier for arch experts to chim
    in when necessary.  This part is done in patch 6-9.
    
    The final patches 10-14 do the rest on the final removal, since *_leaf()
    will be the ultimate API in the future, and we seem to have quite some
    confusions on how *_huge() APIs can be defined, provide a rich comment for
    *_leaf() API set to define them properly to avoid future misuse, and
    hopefully that'll also help new archs to start support huge mappings and
    avoid traps (like either swap entries, or PROT_NONE entry checks).
    
    [1] https://lore.kernel.org/r/20240305043750.93762-1-peterx@xxxxxxxxxx
    
    This patch (of 14):
    
    When the complete PCP is drained a much larger number of pages than the
    usual batch size might be freed at once, causing large IRQ and preemption
    latency spikes, as they are all freed while holding the pcp and zone
    spinlocks.
    
    To avoid those latency spikes, limit the number of pages freed in a single
    bulk operation to common batch limits.
    
    Link: https://lkml.kernel.org/r/20240318200404.448346-1-peterx@xxxxxxxxxx
    Link: https://lkml.kernel.org/r/20240318200736.2835502-1-l.stach@xxxxxxxxxxxxxx
    Signed-off-by: Lucas Stach <l.stach@xxxxxxxxxxxxxx>
    Signed-off-by: Peter Xu <peterx@xxxxxxxxxx>
    Cc: Christophe Leroy <christophe.leroy@xxxxxxxxxx>
    Cc: Jason Gunthorpe <jgg@xxxxxxxxxx>
    Cc: "Matthew Wilcox (Oracle)" <willy@xxxxxxxxxxxxx>
    Cc: Mike Rapoport (IBM) <rppt@xxxxxxxxxx>
    Cc: Muchun Song <muchun.song@xxxxxxxxx>
    Cc: Alistair Popple <apopple@xxxxxxxxxx>
    Cc: Andreas Larsson <andreas@xxxxxxxxxxx>
    Cc: "Aneesh Kumar K.V" <aneesh.kumar@xxxxxxxxxx>
    Cc: Arnd Bergmann <arnd@xxxxxxxx>
    Cc: Bjorn Andersson <andersson@xxxxxxxxxx>
    Cc: Borislav Petkov <bp@xxxxxxxxx>
    Cc: Catalin Marinas <catalin.marinas@xxxxxxx>
    Cc: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
    Cc: David S. Miller <davem@xxxxxxxxxxxxx>
    Cc: Fabio Estevam <festevam@xxxxxxx>
    Cc: Ingo Molnar <mingo@xxxxxxxxxx>
    Cc: Konrad Dybcio <konrad.dybcio@xxxxxxxxxx>
    Cc: Krzysztof Kozlowski <krzysztof.kozlowski@xxxxxxxxxx>
    Cc: Mark Salter <msalter@xxxxxxxxxx>
    Cc: Michael Ellerman <mpe@xxxxxxxxxxxxxx>
    Cc: Naoya Horiguchi <nao.horiguchi@xxxxxxxxx>
    Cc: "Naveen N. Rao" <naveen.n.rao@xxxxxxxxxxxxx>
    Cc: Nicholas Piggin <npiggin@xxxxxxxxx>
    Cc: Russell King <linux@xxxxxxxxxxxxxxx>
    Cc: Shawn Guo <shawnguo@xxxxxxxxxx>
    Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
    Cc: Will Deacon <will@xxxxxxxxxx>
    Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
    Stable-dep-of: 66eca1021a42 ("mm/page_alloc: fix pcp->count race between drain_pages_zone() vs __rmqueue_pcplist()")
    Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8eaf51257db5f..4029d13636ece 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3176,12 +3176,15 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
  */
 static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 {
-	struct per_cpu_pages *pcp;
+	struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+	int count = READ_ONCE(pcp->count);
+
+	while (count) {
+		int to_drain = min(count, pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX);
+		count -= to_drain;
 
-	pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
-	if (pcp->count) {
 		spin_lock(&pcp->lock);
-		free_pcppages_bulk(zone, pcp->count, pcp, 0);
+		free_pcppages_bulk(zone, to_drain, pcp, 0);
 		spin_unlock(&pcp->lock);
 	}
 }