On Mon, Jan 8, 2024 at 10:54 AM Jaroslav Pulchart <jaroslav.pulchart@xxxxxxxxxxxx> wrote: > > > > > > -----Original Message----- > > > From: Igor Raits <igor@xxxxxxxxxxxx> > > > Sent: Thursday, January 4, 2024 3:51 PM > > > To: Jaroslav Pulchart <jaroslav.pulchart@xxxxxxxxxxxx> > > > Cc: Yu Zhao <yuzhao@xxxxxxxxxx>; Daniel Secik > > > <daniel.secik@xxxxxxxxxxxx>; Charan Teja Kalla > > > <quic_charante@xxxxxxxxxxx>; Kalesh Singh <kaleshsingh@xxxxxxxxxx>; > > > akpm@xxxxxxxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; Ertman, David M > > > <david.m.ertman@xxxxxxxxx> > > > Subject: Re: high kswapd CPU usage with symmetrical swap in/out pattern > > > with multi-gen LRU > > > > > > Hello everyone, > > > > > > On Thu, Jan 4, 2024 at 3:34 PM Jaroslav Pulchart > > > <jaroslav.pulchart@xxxxxxxxxxxx> wrote: > > > > > > > > > > > > > > > > > > > > > On Wed, Jan 3, 2024 at 2:30 PM Jaroslav Pulchart > > > > > > <jaroslav.pulchart@xxxxxxxxxxxx> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi yu, > > > > > > > > > > > > > > > > > > On 12/2/2023 5:22 AM, Yu Zhao wrote: > > > > > > > > > > Charan, does the fix previously attached seem acceptable to > > > you? Any > > > > > > > > > > additional feedback? Thanks. > > > > > > > > > > > > > > > > > > First, thanks for taking this patch to upstream. > > > > > > > > > > > > > > > > > > A comment in code snippet is checking just 'high wmark' pages > > > might > > > > > > > > > succeed here but can fail in the immediate kswapd sleep, see > > > > > > > > > prepare_kswapd_sleep(). This can show up into the increased > > > > > > > > > KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary > > > kswapd run time. > > > > > > > > > @Jaroslav: Have you observed something like above? > > > > > > > > > > > > > > > > I do not see any unnecessary kswapd run time, on the contrary it is > > > > > > > > fixing the kswapd continuous run issue. > > > > > > > > > > > > > > > > > > > > > > > > > > So, in downstream, we have something like for > > > zone_watermark_ok(): > > > > > > > > > unsigned long size = wmark_pages(zone, mark) + > > > MIN_LRU_BATCH << 2; > > > > > > > > > > > > > > > > > > Hard to convince of this 'MIN_LRU_BATCH << 2' empirical value, > > > may be we > > > > > > > > > should atleast use the 'MIN_LRU_BATCH' with the mentioned > > > reasoning, is > > > > > > > > > what all I can say for this patch. > > > > > > > > > > > > > > > > > > + mark = sysctl_numa_balancing_mode & > > > NUMA_BALANCING_MEMORY_TIERING ? > > > > > > > > > + WMARK_PROMO : WMARK_HIGH; > > > > > > > > > + for (i = 0; i <= sc->reclaim_idx; i++) { > > > > > > > > > + struct zone *zone = lruvec_pgdat(lruvec)->node_zones + > > > i; > > > > > > > > > + unsigned long size = wmark_pages(zone, mark); > > > > > > > > > + > > > > > > > > > + if (managed_zone(zone) && > > > > > > > > > + !zone_watermark_ok(zone, sc->order, size, sc- > > > >reclaim_idx, 0)) > > > > > > > > > + return false; > > > > > > > > > + } > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > Charan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Jaroslav Pulchart > > > > > > > > Sr. Principal SW Engineer > > > > > > > > GoodData > > > > > > > > > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > today we try to update servers to 6.6.9 which contains the mglru fixes > > > > > > > (from 6.6.8) and the server behaves much much worse. > > > > > > > > > > > > > > I got multiple kswapd* load to ~100% imediatelly. > > > > > > > 555 root 20 0 0 0 0 R 99.7 0.0 4:32.86 > > > > > > > kswapd1 > > > > > > > 554 root 20 0 0 0 0 R 99.3 0.0 3:57.76 > > > > > > > kswapd0 > > > > > > > 556 root 20 0 0 0 0 R 97.7 0.0 3:42.27 > > > > > > > kswapd2 > > > > > > > are the changes in upstream different compared to the initial patch > > > > > > > which I tested? > > > > > > > > > > > > > > Best regards, > > > > > > > Jaroslav Pulchart > > > > > > > > > > > > Hi Jaroslav, > > > > > > > > > > > > My apologies for all the trouble! > > > > > > > > > > > > Yes, there is a slight difference between the fix you verified and > > > > > > what went into 6.6.9. The fix in 6.6.9 is disabled under a special > > > > > > condition which I thought wouldn't affect you. > > > > > > > > > > > > Could you try the attached fix again on top of 6.6.9? It removed that > > > > > > special condition. > > > > > > > > > > > > Thanks! > > > > > > > > > > Thanks for prompt response. I did a test with the patch and it didn't > > > > > help. The situation is super strange. > > > > > > > > > > I tried kernels 6.6.7, 6.6.8 and 6.6.9. I see high memory utilization > > > > > of all numa nodes of the first cpu socket if using 6.6.9 and it is the > > > > > worst situation, but the kswapd load is visible from 6.6.8. > > > > > > > > > > Setup of this server: > > > > > * 4 chiplets per each sockets, there are 2 sockets > > > > > * 32 GB of RAM for each chiplet, 28GB are in hugepages > > > > > Note: previously I have 29GB in Hugepages, I free up 1GB to avoid > > > > > memory pressure however it is even worse now in contrary. > > > > > > > > > > kernel 6.6.7: I do not see kswapd usage when application started == OK > > > > > NUMA nodes: 0 1 2 3 4 5 6 7 > > > > > HPTotalGiB: 28 28 28 28 28 28 28 28 > > > > > HPFreeGiB: 28 28 28 28 28 28 28 28 > > > > > MemTotal: 32264 32701 32701 32686 32701 32659 32701 32696 > > > > > MemFree: 2766 2715 63 2366 3495 2990 3462 252 > > > > > > > > > > kernel 6.6.8: I see kswapd on nodes 2 and 3 when application started > > > > > NUMA nodes: 0 1 2 3 4 5 6 7 > > > > > HPTotalGiB: 28 28 28 28 28 28 28 28 > > > > > HPFreeGiB: 28 28 28 28 28 28 28 28 > > > > > MemTotal: 32264 32701 32701 32686 32701 32701 32659 32696 > > > > > MemFree: 2744 2788 65 581 3304 3215 3266 2226 > > > > > > > > > > kernel 6.6.9: I see kswapd on nodes 0, 1, 2 and 3 when application started > > > > > NUMA nodes: 0 1 2 3 4 5 6 7 > > > > > HPTotalGiB: 28 28 28 28 28 28 28 28 > > > > > HPFreeGiB: 28 28 28 28 28 28 28 28 > > > > > MemTotal: 32264 32701 32701 32686 32659 32701 32701 32696 > > > > > MemFree: 75 60 60 60 3169 2784 3203 2944 > > > > > > > > I run few more combinations, and here are results / findings: > > > > > > > > 6.6.7-1 (vanila) == OK, no issue > > > > > > > > 6.6.8-1 (vanila) == single kswapd 100% ! > > > > 6.6.8-1 (vanila plus mglru-fix-6.6.9.patch) == OK, no issue > > > > 6.6.8-1 (revert four mglru patches) == OK, no issue > > > > > > > > 6.6.9-1 (vanila) == four kswapd 100% !!!! > > > > 6.6.9-2 (vanila plus mglru-fix-6.6.9.patch) == four kswapd 100% !!!! > > > > 6.6.9-3 (revert four mglru patches) == four kswapd 100% !!!! > > > > > > > > Summary: > > > > * mglru-fix-6.6.9.patch or reverting mglru patches helps in case of > > > > kernel 6.6.8, > > > > * there is (new?) problem in case of 6.6.9 kernel, which looks not to > > > > be related to mglru patches at all > > > > > > I was able to bisect this change and it looks like there is something > > > going wrong with the ice driver… > > > > > > Usually after booting our server we see something like this. Most of > > > the nodes have ~2-3G of free memory. There are always 1-2 NUMA nodes > > > that have a really low amount of free memory and we don't know why but > > > it looks like that in the end causes the constant swap in/out issue. > > > With the final bit of the patch you've sent earlier in this thread it > > > is almost invisible. > > > > > > NUMA nodes: 0 1 2 3 4 5 6 7 > > > HPTotalGiB: 28 28 28 28 28 28 28 28 > > > HPFreeGiB: 28 28 28 28 28 28 28 28 > > > MemTotal: 32264 32701 32659 32686 32701 32701 32701 32696 > > > MemFree: 2191 2828 92 292 3344 2916 3594 3222 > > > > > > > > > However, after the following patch we see that more NUMA nodes have > > > such a low amount of memory and that is causing constant reclaiming > > > of memory because it looks like something inside of the kernel ate all > > > the memory. This is right after the start of the system as well. > > > > > > NUMA nodes: 0 1 2 3 4 5 6 7 > > > HPTotalGiB: 28 28 28 28 28 28 28 28 > > > HPFreeGiB: 28 28 28 28 28 28 28 28 > > > MemTotal: 32264 32701 32659 32686 32701 32701 32701 32696 > > > MemFree: 46 59 51 33 3078 3535 2708 3511 > > > > > > The difference is 18G vs 12G of free memory sum'd across all NUMA > > > nodes right after boot of the system. If you have some hints on how to > > > debug what is actually occupying all that memory, maybe in both cases > > > - would be happy to debug more! > > > > > > Dave, would you have any idea why that patch could cause such a boost > > > in memory utilization? > > > > > > commit fc4d6d136d42fab207b3ce20a8ebfd61a13f931f > > > Author: Dave Ertman <david.m.ertman@xxxxxxxxx> > > > Date: Mon Dec 11 13:19:28 2023 -0800 > > > > > > ice: alter feature support check for SRIOV and LAG > > > > > > [ Upstream commit 4d50fcdc2476eef94c14c6761073af5667bb43b6 ] > > > > > > Previously, the ice driver had support for using a handler for bonding > > > netdev events to ensure that conflicting features were not allowed to be > > > activated at the same time. While this was still in place, additional > > > support was added to specifically support SRIOV and LAG together. These > > > both utilized the netdev event handler, but the SRIOV and LAG feature > > > was > > > behind a capabilities feature check to make sure the current NVM has > > > support. > > > > > > The exclusion part of the event handler should be removed since there are > > > users who have custom made solutions that depend on the non-exclusion > > > of > > > features. > > > > > > Wrap the creation/registration and cleanup of the event handler and > > > associated structs in the probe flow with a feature check so that the > > > only systems that support the full implementation of LAG features will > > > initialize support. This will leave other systems unhindered with > > > functionality as it existed before any LAG code was added. > > > > Igor, > > > > I have no idea why that two line commit would do anything to increase memory usage by the ice driver. > > If anything, I would expect it to lower memory usage as it has the potential to stop the allocation of memory > > for the pf->lag struct. > > > > DaveE > > Hello, > > I believe we can track it as two different issues. So I reported the > ICE driver commit as a email with subject "[REGRESSION] Intel ICE > Ethernet driver in linux >= 6.6.9 triggers extra memory consumption > and cause continous kswapd* usage and continuous swapping" to > Jesse Brandeburg <jesse.brandeburg@xxxxxxxxx> > Tony Nguyen <anthony.l.nguyen@xxxxxxxxx> > intel-wired-lan@xxxxxxxxxxxxxxxx > Dave Ertman <david.m.ertman@xxxxxxxxx> > > Lets track the mglru here in this email thread. Yu, the kernel build > with your mglru-fix-6.6.9.patch seem to be OK at least running it for > 3days without kswapd usage (excluding the ice driver commit). Hi Jaroslav, Do we now have a clear conclusion that mglru-fix-6.6.9.patch made a difference? IOW, were you able to reproduce the problem consistently without it? Thanks!