Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU

Igor Raits <igor@xxxxxxxxxxxx> · Fri, 5 Jan 2024 00:51:10 +0100

Hello everyone,

On Thu, Jan 4, 2024 at 3:34 PM Jaroslav Pulchart
<jaroslav.pulchart@xxxxxxxxxxxx> wrote:
>
> >
> > >
> > > On Wed, Jan 3, 2024 at 2:30 PM Jaroslav Pulchart
> > > <jaroslav.pulchart@xxxxxxxxxxxx> wrote:
> > > >
> > > > >
> > > > > >
> > > > > > Hi yu,
> > > > > >
> > > > > > On 12/2/2023 5:22 AM, Yu Zhao wrote:
> > > > > > > Charan, does the fix previously attached seem acceptable to you? Any
> > > > > > > additional feedback? Thanks.
> > > > > >
> > > > > > First, thanks for taking this patch to upstream.
> > > > > >
> > > > > > A comment in code snippet is checking just 'high wmark' pages might
> > > > > > succeed here but can fail in the immediate kswapd sleep, see
> > > > > > prepare_kswapd_sleep(). This can show up into the increased
> > > > > > KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary kswapd run time.
> > > > > > @Jaroslav: Have you observed something like above?
> > > > >
> > > > > I do not see any unnecessary kswapd run time, on the contrary it is
> > > > > fixing the kswapd continuous run issue.
> > > > >
> > > > > >
> > > > > > So, in downstream, we have something like for zone_watermark_ok():
> > > > > > unsigned long size = wmark_pages(zone, mark) + MIN_LRU_BATCH << 2;
> > > > > >
> > > > > > Hard to convince of this 'MIN_LRU_BATCH << 2' empirical value, may be we
> > > > > > should atleast use the 'MIN_LRU_BATCH' with the mentioned reasoning, is
> > > > > > what all I can say for this patch.
> > > > > >
> > > > > > +       mark = sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING ?
> > > > > > +              WMARK_PROMO : WMARK_HIGH;
> > > > > > +       for (i = 0; i <= sc->reclaim_idx; i++) {
> > > > > > +               struct zone *zone = lruvec_pgdat(lruvec)->node_zones + i;
> > > > > > +               unsigned long size = wmark_pages(zone, mark);
> > > > > > +
> > > > > > +               if (managed_zone(zone) &&
> > > > > > +                   !zone_watermark_ok(zone, sc->order, size, sc->reclaim_idx, 0))
> > > > > > +                       return false;
> > > > > > +       }
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Charan
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Jaroslav Pulchart
> > > > > Sr. Principal SW Engineer
> > > > > GoodData
> > > >
> > > >
> > > > Hello,
> > > >
> > > > today we try to update servers to 6.6.9 which contains the mglru fixes
> > > > (from 6.6.8) and the server behaves much much worse.
> > > >
> > > > I got multiple kswapd* load to ~100% imediatelly.
> > > >     555 root      20   0       0      0      0 R  99.7   0.0   4:32.86
> > > > kswapd1
> > > >     554 root      20   0       0      0      0 R  99.3   0.0   3:57.76
> > > > kswapd0
> > > >     556 root      20   0       0      0      0 R  97.7   0.0   3:42.27
> > > > kswapd2
> > > > are the changes in upstream different compared to the initial patch
> > > > which I tested?
> > > >
> > > > Best regards,
> > > > Jaroslav Pulchart
> > >
> > > Hi Jaroslav,
> > >
> > > My apologies for all the trouble!
> > >
> > > Yes, there is a slight difference between the fix you verified and
> > > what went into 6.6.9. The fix in 6.6.9 is disabled under a special
> > > condition which I thought wouldn't affect you.
> > >
> > > Could you try the attached fix again on top of 6.6.9? It removed that
> > > special condition.
> > >
> > > Thanks!
> >
> > Thanks for prompt response. I did a test with the patch and it didn't
> > help. The situation is super strange.
> >
> > I tried kernels 6.6.7, 6.6.8 and  6.6.9. I see high memory utilization
> > of all numa nodes of the first cpu socket if using 6.6.9 and it is the
> > worst situation, but the kswapd load is visible from 6.6.8.
> >
> > Setup of this server:
> > * 4 chiplets per each sockets, there are 2 sockets
> > * 32 GB of RAM for each chiplet, 28GB are in hugepages
> >   Note: previously I have 29GB in Hugepages, I free up 1GB to avoid
> > memory pressure however it is even worse now in contrary.
> >
> > kernel 6.6.7: I do not see kswapd usage when application started == OK
> > NUMA nodes: 0 1 2 3 4 5 6 7
> > HPTotalGiB: 28 28 28 28 28 28 28 28
> > HPFreeGiB: 28 28 28 28 28 28 28 28
> > MemTotal: 32264 32701 32701 32686 32701 32659 32701 32696
> > MemFree: 2766 2715 63 2366 3495 2990 3462 252
> >
> > kernel 6.6.8: I see kswapd on nodes 2 and 3 when application started
> > NUMA nodes: 0 1 2 3 4 5 6 7
> > HPTotalGiB: 28 28 28 28 28 28 28 28
> > HPFreeGiB: 28 28 28 28 28 28 28 28
> > MemTotal: 32264 32701 32701 32686 32701 32701 32659 32696
> > MemFree: 2744 2788 65 581 3304 3215 3266 2226
> >
> > kernel 6.6.9: I see kswapd on nodes 0, 1, 2 and 3 when application started
> > NUMA nodes: 0 1 2 3 4 5 6 7
> > HPTotalGiB: 28 28 28 28 28 28 28 28
> > HPFreeGiB: 28 28 28 28 28 28 28 28
> > MemTotal: 32264 32701 32701 32686 32659 32701 32701 32696
> > MemFree: 75 60 60 60 3169 2784 3203 2944
>
> I run few more combinations, and here are results / findings:
>
>   6.6.7-1  (vanila)                            == OK, no issue
>
>   6.6.8-1  (vanila)                            == single kswapd 100% !
>   6.6.8-1  (vanila plus mglru-fix-6.6.9.patch) == OK, no issue
>   6.6.8-1  (revert four mglru patches)         == OK, no issue
>
>   6.6.9-1  (vanila)                            == four kswapd 100% !!!!
>   6.6.9-2  (vanila plus mglru-fix-6.6.9.patch) == four kswapd 100% !!!!
>   6.6.9-3  (revert four mglru patches)         == four kswapd 100% !!!!
>
> Summary:
> * mglru-fix-6.6.9.patch or reverting mglru patches helps in case of
> kernel 6.6.8,
> * there is (new?) problem in case of 6.6.9 kernel, which looks not to
> be related to mglru patches at all

I was able to bisect this change and it looks like there is something
going wrong with the ice driver…

Usually after booting our server we see something like this. Most of
the nodes have ~2-3G of free memory. There are always 1-2 NUMA nodes
that have a really low amount of free memory and we don't know why but
it looks like that in the end causes the constant swap in/out issue.
With the final bit of the patch you've sent earlier in this thread it
is almost invisible.

NUMA nodes:     0       1       2       3       4       5       6       7
HPTotalGiB:     28      28      28      28      28      28      28      28
HPFreeGiB:      28      28      28      28      28      28      28      28
MemTotal:       32264   32701   32659   32686   32701   32701   32701   32696
MemFree:        2191    2828    92      292     3344    2916    3594    3222

However, after the following patch we see that more NUMA nodes have
such a low amount of memory and  that is causing constant reclaiming
of memory because it looks like something inside of the kernel ate all
the memory. This is right after the start of the system as well.

NUMA nodes:     0       1       2       3       4       5       6       7
HPTotalGiB:     28      28      28      28      28      28      28      28
HPFreeGiB:      28      28      28      28      28      28      28      28
MemTotal:       32264   32701   32659   32686   32701   32701   32701   32696
MemFree:        46      59      51      33      3078    3535    2708    3511

The difference is 18G vs 12G of free memory sum'd across all NUMA
nodes right after boot of the system. If you have some hints on how to
debug what is actually occupying all that memory, maybe in both cases
- would be happy to debug more!

Dave, would you have any idea why that patch could cause such a boost
in memory utilization?

commit fc4d6d136d42fab207b3ce20a8ebfd61a13f931f
Author: Dave Ertman <david.m.ertman@xxxxxxxxx>
Date:   Mon Dec 11 13:19:28 2023 -0800

    ice: alter feature support check for SRIOV and LAG

    [ Upstream commit 4d50fcdc2476eef94c14c6761073af5667bb43b6 ]

    Previously, the ice driver had support for using a handler for bonding
    netdev events to ensure that conflicting features were not allowed to be
    activated at the same time.  While this was still in place, additional
    support was added to specifically support SRIOV and LAG together.  These
    both utilized the netdev event handler, but the SRIOV and LAG feature was
    behind a capabilities feature check to make sure the current NVM has
    support.

    The exclusion part of the event handler should be removed since there are
    users who have custom made solutions that depend on the non-exclusion of
    features.

    Wrap the creation/registration and cleanup of the event handler and
    associated structs in the probe flow with a feature check so that the
    only systems that support the full implementation of LAG features will
    initialize support.  This will leave other systems unhindered with
    functionality as it existed before any LAG code was added.