Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU

Jaroslav Pulchart <jaroslav.pulchart@xxxxxxxxxxxx> · Thu, 4 Jan 2024 10:46:49 +0100

>
> On Wed, Jan 3, 2024 at 2:30 PM Jaroslav Pulchart
> <jaroslav.pulchart@xxxxxxxxxxxx> wrote:
> >
> > >
> > > >
> > > > Hi yu,
> > > >
> > > > On 12/2/2023 5:22 AM, Yu Zhao wrote:
> > > > > Charan, does the fix previously attached seem acceptable to you? Any
> > > > > additional feedback? Thanks.
> > > >
> > > > First, thanks for taking this patch to upstream.
> > > >
> > > > A comment in code snippet is checking just 'high wmark' pages might
> > > > succeed here but can fail in the immediate kswapd sleep, see
> > > > prepare_kswapd_sleep(). This can show up into the increased
> > > > KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary kswapd run time.
> > > > @Jaroslav: Have you observed something like above?
> > >
> > > I do not see any unnecessary kswapd run time, on the contrary it is
> > > fixing the kswapd continuous run issue.
> > >
> > > >
> > > > So, in downstream, we have something like for zone_watermark_ok():
> > > > unsigned long size = wmark_pages(zone, mark) + MIN_LRU_BATCH << 2;
> > > >
> > > > Hard to convince of this 'MIN_LRU_BATCH << 2' empirical value, may be we
> > > > should atleast use the 'MIN_LRU_BATCH' with the mentioned reasoning, is
> > > > what all I can say for this patch.
> > > >
> > > > +       mark = sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING ?
> > > > +              WMARK_PROMO : WMARK_HIGH;
> > > > +       for (i = 0; i <= sc->reclaim_idx; i++) {
> > > > +               struct zone *zone = lruvec_pgdat(lruvec)->node_zones + i;
> > > > +               unsigned long size = wmark_pages(zone, mark);
> > > > +
> > > > +               if (managed_zone(zone) &&
> > > > +                   !zone_watermark_ok(zone, sc->order, size, sc->reclaim_idx, 0))
> > > > +                       return false;
> > > > +       }
> > > >
> > > >
> > > > Thanks,
> > > > Charan
> > >
> > >
> > >
> > > --
> > > Jaroslav Pulchart
> > > Sr. Principal SW Engineer
> > > GoodData
> >
> >
> > Hello,
> >
> > today we try to update servers to 6.6.9 which contains the mglru fixes
> > (from 6.6.8) and the server behaves much much worse.
> >
> > I got multiple kswapd* load to ~100% imediatelly.
> >     555 root      20   0       0      0      0 R  99.7   0.0   4:32.86
> > kswapd1
> >     554 root      20   0       0      0      0 R  99.3   0.0   3:57.76
> > kswapd0
> >     556 root      20   0       0      0      0 R  97.7   0.0   3:42.27
> > kswapd2
> > are the changes in upstream different compared to the initial patch
> > which I tested?
> >
> > Best regards,
> > Jaroslav Pulchart
>
> Hi Jaroslav,
>
> My apologies for all the trouble!
>
> Yes, there is a slight difference between the fix you verified and
> what went into 6.6.9. The fix in 6.6.9 is disabled under a special
> condition which I thought wouldn't affect you.
>
> Could you try the attached fix again on top of 6.6.9? It removed that
> special condition.
>
> Thanks!

Thanks for prompt response. I did a test with the patch and it didn't
help. The situation is super strange.

I tried kernels 6.6.7, 6.6.8 and  6.6.9. I see high memory utilization
of all numa nodes of the first cpu socket if using 6.6.9 and it is the
worst situation, but the kswapd load is visible from 6.6.8.

Setup of this server:
* 4 chiplets per each sockets, there are 2 sockets
* 32 GB of RAM for each chiplet, 28GB are in hugepages
  Note: previously I have 29GB in Hugepages, I free up 1GB to avoid
memory pressure however it is even worse now in contrary.

kernel 6.6.7: I do not see kswapd usage when application started == OK
NUMA nodes: 0 1 2 3 4 5 6 7
HPTotalGiB: 28 28 28 28 28 28 28 28
HPFreeGiB: 28 28 28 28 28 28 28 28
MemTotal: 32264 32701 32701 32686 32701 32659 32701 32696
MemFree: 2766 2715 63 2366 3495 2990 3462 252

kernel 6.6.8: I see kswapd on nodes 2 and 3 when application started
NUMA nodes: 0 1 2 3 4 5 6 7
HPTotalGiB: 28 28 28 28 28 28 28 28
HPFreeGiB: 28 28 28 28 28 28 28 28
MemTotal: 32264 32701 32701 32686 32701 32701 32659 32696
MemFree: 2744 2788 65 581 3304 3215 3266 2226

kernel 6.6.9: I see kswapd on nodes 0, 1, 2 and 3 when application started
NUMA nodes: 0 1 2 3 4 5 6 7
HPTotalGiB: 28 28 28 28 28 28 28 28
HPFreeGiB: 28 28 28 28 28 28 28 28
MemTotal: 32264 32701 32701 32686 32659 32701 32701 32696
MemFree: 75 60 60 60 3169 2784 3203 2944