RE: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU

"Ertman, David M" <david.m.ertman@xxxxxxxxx> · Fri, 5 Jan 2024 17:35:07 +0000

> -----Original Message-----
> From: Igor Raits <igor@xxxxxxxxxxxx>
> Sent: Thursday, January 4, 2024 3:51 PM
> To: Jaroslav Pulchart <jaroslav.pulchart@xxxxxxxxxxxx>
> Cc: Yu Zhao <yuzhao@xxxxxxxxxx>; Daniel Secik
> <daniel.secik@xxxxxxxxxxxx>; Charan Teja Kalla
> <quic_charante@xxxxxxxxxxx>; Kalesh Singh <kaleshsingh@xxxxxxxxxx>;
> akpm@xxxxxxxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; Ertman, David M
> <david.m.ertman@xxxxxxxxx>
> Subject: Re: high kswapd CPU usage with symmetrical swap in/out pattern
> with multi-gen LRU
> 
> Hello everyone,
> 
> On Thu, Jan 4, 2024 at 3:34 PM Jaroslav Pulchart
> <jaroslav.pulchart@xxxxxxxxxxxx> wrote:
> >
> > >
> > > >
> > > > On Wed, Jan 3, 2024 at 2:30 PM Jaroslav Pulchart
> > > > <jaroslav.pulchart@xxxxxxxxxxxx> wrote:
> > > > >
> > > > > >
> > > > > > >
> > > > > > > Hi yu,
> > > > > > >
> > > > > > > On 12/2/2023 5:22 AM, Yu Zhao wrote:
> > > > > > > > Charan, does the fix previously attached seem acceptable to
> you? Any
> > > > > > > > additional feedback? Thanks.
> > > > > > >
> > > > > > > First, thanks for taking this patch to upstream.
> > > > > > >
> > > > > > > A comment in code snippet is checking just 'high wmark' pages
> might
> > > > > > > succeed here but can fail in the immediate kswapd sleep, see
> > > > > > > prepare_kswapd_sleep(). This can show up into the increased
> > > > > > > KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary
> kswapd run time.
> > > > > > > @Jaroslav: Have you observed something like above?
> > > > > >
> > > > > > I do not see any unnecessary kswapd run time, on the contrary it is
> > > > > > fixing the kswapd continuous run issue.
> > > > > >
> > > > > > >
> > > > > > > So, in downstream, we have something like for
> zone_watermark_ok():
> > > > > > > unsigned long size = wmark_pages(zone, mark) +
> MIN_LRU_BATCH << 2;
> > > > > > >
> > > > > > > Hard to convince of this 'MIN_LRU_BATCH << 2' empirical value,
> may be we
> > > > > > > should atleast use the 'MIN_LRU_BATCH' with the mentioned
> reasoning, is
> > > > > > > what all I can say for this patch.
> > > > > > >
> > > > > > > +       mark = sysctl_numa_balancing_mode &
> NUMA_BALANCING_MEMORY_TIERING ?
> > > > > > > +              WMARK_PROMO : WMARK_HIGH;
> > > > > > > +       for (i = 0; i <= sc->reclaim_idx; i++) {
> > > > > > > +               struct zone *zone = lruvec_pgdat(lruvec)->node_zones +
> i;
> > > > > > > +               unsigned long size = wmark_pages(zone, mark);
> > > > > > > +
> > > > > > > +               if (managed_zone(zone) &&
> > > > > > > +                   !zone_watermark_ok(zone, sc->order, size, sc-
> >reclaim_idx, 0))
> > > > > > > +                       return false;
> > > > > > > +       }
> > > > > > >
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Charan
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Jaroslav Pulchart
> > > > > > Sr. Principal SW Engineer
> > > > > > GoodData
> > > > >
> > > > >
> > > > > Hello,
> > > > >
> > > > > today we try to update servers to 6.6.9 which contains the mglru fixes
> > > > > (from 6.6.8) and the server behaves much much worse.
> > > > >
> > > > > I got multiple kswapd* load to ~100% imediatelly.
> > > > >     555 root      20   0       0      0      0 R  99.7   0.0   4:32.86
> > > > > kswapd1
> > > > >     554 root      20   0       0      0      0 R  99.3   0.0   3:57.76
> > > > > kswapd0
> > > > >     556 root      20   0       0      0      0 R  97.7   0.0   3:42.27
> > > > > kswapd2
> > > > > are the changes in upstream different compared to the initial patch
> > > > > which I tested?
> > > > >
> > > > > Best regards,
> > > > > Jaroslav Pulchart
> > > >
> > > > Hi Jaroslav,
> > > >
> > > > My apologies for all the trouble!
> > > >
> > > > Yes, there is a slight difference between the fix you verified and
> > > > what went into 6.6.9. The fix in 6.6.9 is disabled under a special
> > > > condition which I thought wouldn't affect you.
> > > >
> > > > Could you try the attached fix again on top of 6.6.9? It removed that
> > > > special condition.
> > > >
> > > > Thanks!
> > >
> > > Thanks for prompt response. I did a test with the patch and it didn't
> > > help. The situation is super strange.
> > >
> > > I tried kernels 6.6.7, 6.6.8 and  6.6.9. I see high memory utilization
> > > of all numa nodes of the first cpu socket if using 6.6.9 and it is the
> > > worst situation, but the kswapd load is visible from 6.6.8.
> > >
> > > Setup of this server:
> > > * 4 chiplets per each sockets, there are 2 sockets
> > > * 32 GB of RAM for each chiplet, 28GB are in hugepages
> > >   Note: previously I have 29GB in Hugepages, I free up 1GB to avoid
> > > memory pressure however it is even worse now in contrary.
> > >
> > > kernel 6.6.7: I do not see kswapd usage when application started == OK
> > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > MemTotal: 32264 32701 32701 32686 32701 32659 32701 32696
> > > MemFree: 2766 2715 63 2366 3495 2990 3462 252
> > >
> > > kernel 6.6.8: I see kswapd on nodes 2 and 3 when application started
> > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > MemTotal: 32264 32701 32701 32686 32701 32701 32659 32696
> > > MemFree: 2744 2788 65 581 3304 3215 3266 2226
> > >
> > > kernel 6.6.9: I see kswapd on nodes 0, 1, 2 and 3 when application started
> > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > MemTotal: 32264 32701 32701 32686 32659 32701 32701 32696
> > > MemFree: 75 60 60 60 3169 2784 3203 2944
> >
> > I run few more combinations, and here are results / findings:
> >
> >   6.6.7-1  (vanila)                            == OK, no issue
> >
> >   6.6.8-1  (vanila)                            == single kswapd 100% !
> >   6.6.8-1  (vanila plus mglru-fix-6.6.9.patch) == OK, no issue
> >   6.6.8-1  (revert four mglru patches)         == OK, no issue
> >
> >   6.6.9-1  (vanila)                            == four kswapd 100% !!!!
> >   6.6.9-2  (vanila plus mglru-fix-6.6.9.patch) == four kswapd 100% !!!!
> >   6.6.9-3  (revert four mglru patches)         == four kswapd 100% !!!!
> >
> > Summary:
> > * mglru-fix-6.6.9.patch or reverting mglru patches helps in case of
> > kernel 6.6.8,
> > * there is (new?) problem in case of 6.6.9 kernel, which looks not to
> > be related to mglru patches at all
> 
> I was able to bisect this change and it looks like there is something
> going wrong with the ice driver…
> 
> Usually after booting our server we see something like this. Most of
> the nodes have ~2-3G of free memory. There are always 1-2 NUMA nodes
> that have a really low amount of free memory and we don't know why but
> it looks like that in the end causes the constant swap in/out issue.
> With the final bit of the patch you've sent earlier in this thread it
> is almost invisible.
> 
> NUMA nodes:     0       1       2       3       4       5       6       7
> HPTotalGiB:     28      28      28      28      28      28      28      28
> HPFreeGiB:      28      28      28      28      28      28      28      28
> MemTotal:       32264   32701   32659   32686   32701   32701   32701   32696
> MemFree:        2191    2828    92      292     3344    2916    3594    3222
> 
> 
> However, after the following patch we see that more NUMA nodes have
> such a low amount of memory and  that is causing constant reclaiming
> of memory because it looks like something inside of the kernel ate all
> the memory. This is right after the start of the system as well.
> 
> NUMA nodes:     0       1       2       3       4       5       6       7
> HPTotalGiB:     28      28      28      28      28      28      28      28
> HPFreeGiB:      28      28      28      28      28      28      28      28
> MemTotal:       32264   32701   32659   32686   32701   32701   32701   32696
> MemFree:        46      59      51      33      3078    3535    2708    3511
> 
> The difference is 18G vs 12G of free memory sum'd across all NUMA
> nodes right after boot of the system. If you have some hints on how to
> debug what is actually occupying all that memory, maybe in both cases
> - would be happy to debug more!
> 
> Dave, would you have any idea why that patch could cause such a boost
> in memory utilization?
> 
> commit fc4d6d136d42fab207b3ce20a8ebfd61a13f931f
> Author: Dave Ertman <david.m.ertman@xxxxxxxxx>
> Date:   Mon Dec 11 13:19:28 2023 -0800
> 
>     ice: alter feature support check for SRIOV and LAG
> 
>     [ Upstream commit 4d50fcdc2476eef94c14c6761073af5667bb43b6 ]
> 
>     Previously, the ice driver had support for using a handler for bonding
>     netdev events to ensure that conflicting features were not allowed to be
>     activated at the same time.  While this was still in place, additional
>     support was added to specifically support SRIOV and LAG together.  These
>     both utilized the netdev event handler, but the SRIOV and LAG feature
> was
>     behind a capabilities feature check to make sure the current NVM has
>     support.
> 
>     The exclusion part of the event handler should be removed since there are
>     users who have custom made solutions that depend on the non-exclusion
> of
>     features.
> 
>     Wrap the creation/registration and cleanup of the event handler and
>     associated structs in the probe flow with a feature check so that the
>     only systems that support the full implementation of LAG features will
>     initialize support.  This will leave other systems unhindered with
>     functionality as it existed before any LAG code was added.

Igor,

I have no idea why that two line commit would do anything to increase memory usage by the ice driver.
If anything, I would expect it to lower memory usage as it has the potential to stop the allocation of memory
for the pf->lag struct.

DaveE