Hello everyone, On Thu, Jan 4, 2024 at 3:34 PM Jaroslav Pulchart <jaroslav.pulchart@xxxxxxxxxxxx> wrote: > > > > > > > > > On Wed, Jan 3, 2024 at 2:30 PM Jaroslav Pulchart > > > <jaroslav.pulchart@xxxxxxxxxxxx> wrote: > > > > > > > > > > > > > > > > > > > > > Hi yu, > > > > > > > > > > > > On 12/2/2023 5:22 AM, Yu Zhao wrote: > > > > > > > Charan, does the fix previously attached seem acceptable to you? Any > > > > > > > additional feedback? Thanks. > > > > > > > > > > > > First, thanks for taking this patch to upstream. > > > > > > > > > > > > A comment in code snippet is checking just 'high wmark' pages might > > > > > > succeed here but can fail in the immediate kswapd sleep, see > > > > > > prepare_kswapd_sleep(). This can show up into the increased > > > > > > KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary kswapd run time. > > > > > > @Jaroslav: Have you observed something like above? > > > > > > > > > > I do not see any unnecessary kswapd run time, on the contrary it is > > > > > fixing the kswapd continuous run issue. > > > > > > > > > > > > > > > > > So, in downstream, we have something like for zone_watermark_ok(): > > > > > > unsigned long size = wmark_pages(zone, mark) + MIN_LRU_BATCH << 2; > > > > > > > > > > > > Hard to convince of this 'MIN_LRU_BATCH << 2' empirical value, may be we > > > > > > should atleast use the 'MIN_LRU_BATCH' with the mentioned reasoning, is > > > > > > what all I can say for this patch. > > > > > > > > > > > > + mark = sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING ? > > > > > > + WMARK_PROMO : WMARK_HIGH; > > > > > > + for (i = 0; i <= sc->reclaim_idx; i++) { > > > > > > + struct zone *zone = lruvec_pgdat(lruvec)->node_zones + i; > > > > > > + unsigned long size = wmark_pages(zone, mark); > > > > > > + > > > > > > + if (managed_zone(zone) && > > > > > > + !zone_watermark_ok(zone, sc->order, size, sc->reclaim_idx, 0)) > > > > > > + return false; > > > > > > + } > > > > > > > > > > > > > > > > > > Thanks, > > > > > > Charan > > > > > > > > > > > > > > > > > > > > -- > > > > > Jaroslav Pulchart > > > > > Sr. Principal SW Engineer > > > > > GoodData > > > > > > > > > > > > Hello, > > > > > > > > today we try to update servers to 6.6.9 which contains the mglru fixes > > > > (from 6.6.8) and the server behaves much much worse. > > > > > > > > I got multiple kswapd* load to ~100% imediatelly. > > > > 555 root 20 0 0 0 0 R 99.7 0.0 4:32.86 > > > > kswapd1 > > > > 554 root 20 0 0 0 0 R 99.3 0.0 3:57.76 > > > > kswapd0 > > > > 556 root 20 0 0 0 0 R 97.7 0.0 3:42.27 > > > > kswapd2 > > > > are the changes in upstream different compared to the initial patch > > > > which I tested? > > > > > > > > Best regards, > > > > Jaroslav Pulchart > > > > > > Hi Jaroslav, > > > > > > My apologies for all the trouble! > > > > > > Yes, there is a slight difference between the fix you verified and > > > what went into 6.6.9. The fix in 6.6.9 is disabled under a special > > > condition which I thought wouldn't affect you. > > > > > > Could you try the attached fix again on top of 6.6.9? It removed that > > > special condition. > > > > > > Thanks! > > > > Thanks for prompt response. I did a test with the patch and it didn't > > help. The situation is super strange. > > > > I tried kernels 6.6.7, 6.6.8 and 6.6.9. I see high memory utilization > > of all numa nodes of the first cpu socket if using 6.6.9 and it is the > > worst situation, but the kswapd load is visible from 6.6.8. > > > > Setup of this server: > > * 4 chiplets per each sockets, there are 2 sockets > > * 32 GB of RAM for each chiplet, 28GB are in hugepages > > Note: previously I have 29GB in Hugepages, I free up 1GB to avoid > > memory pressure however it is even worse now in contrary. > > > > kernel 6.6.7: I do not see kswapd usage when application started == OK > > NUMA nodes: 0 1 2 3 4 5 6 7 > > HPTotalGiB: 28 28 28 28 28 28 28 28 > > HPFreeGiB: 28 28 28 28 28 28 28 28 > > MemTotal: 32264 32701 32701 32686 32701 32659 32701 32696 > > MemFree: 2766 2715 63 2366 3495 2990 3462 252 > > > > kernel 6.6.8: I see kswapd on nodes 2 and 3 when application started > > NUMA nodes: 0 1 2 3 4 5 6 7 > > HPTotalGiB: 28 28 28 28 28 28 28 28 > > HPFreeGiB: 28 28 28 28 28 28 28 28 > > MemTotal: 32264 32701 32701 32686 32701 32701 32659 32696 > > MemFree: 2744 2788 65 581 3304 3215 3266 2226 > > > > kernel 6.6.9: I see kswapd on nodes 0, 1, 2 and 3 when application started > > NUMA nodes: 0 1 2 3 4 5 6 7 > > HPTotalGiB: 28 28 28 28 28 28 28 28 > > HPFreeGiB: 28 28 28 28 28 28 28 28 > > MemTotal: 32264 32701 32701 32686 32659 32701 32701 32696 > > MemFree: 75 60 60 60 3169 2784 3203 2944 > > I run few more combinations, and here are results / findings: > > 6.6.7-1 (vanila) == OK, no issue > > 6.6.8-1 (vanila) == single kswapd 100% ! > 6.6.8-1 (vanila plus mglru-fix-6.6.9.patch) == OK, no issue > 6.6.8-1 (revert four mglru patches) == OK, no issue > > 6.6.9-1 (vanila) == four kswapd 100% !!!! > 6.6.9-2 (vanila plus mglru-fix-6.6.9.patch) == four kswapd 100% !!!! > 6.6.9-3 (revert four mglru patches) == four kswapd 100% !!!! > > Summary: > * mglru-fix-6.6.9.patch or reverting mglru patches helps in case of > kernel 6.6.8, > * there is (new?) problem in case of 6.6.9 kernel, which looks not to > be related to mglru patches at all I was able to bisect this change and it looks like there is something going wrong with the ice driver… Usually after booting our server we see something like this. Most of the nodes have ~2-3G of free memory. There are always 1-2 NUMA nodes that have a really low amount of free memory and we don't know why but it looks like that in the end causes the constant swap in/out issue. With the final bit of the patch you've sent earlier in this thread it is almost invisible. NUMA nodes: 0 1 2 3 4 5 6 7 HPTotalGiB: 28 28 28 28 28 28 28 28 HPFreeGiB: 28 28 28 28 28 28 28 28 MemTotal: 32264 32701 32659 32686 32701 32701 32701 32696 MemFree: 2191 2828 92 292 3344 2916 3594 3222 However, after the following patch we see that more NUMA nodes have such a low amount of memory and that is causing constant reclaiming of memory because it looks like something inside of the kernel ate all the memory. This is right after the start of the system as well. NUMA nodes: 0 1 2 3 4 5 6 7 HPTotalGiB: 28 28 28 28 28 28 28 28 HPFreeGiB: 28 28 28 28 28 28 28 28 MemTotal: 32264 32701 32659 32686 32701 32701 32701 32696 MemFree: 46 59 51 33 3078 3535 2708 3511 The difference is 18G vs 12G of free memory sum'd across all NUMA nodes right after boot of the system. If you have some hints on how to debug what is actually occupying all that memory, maybe in both cases - would be happy to debug more! Dave, would you have any idea why that patch could cause such a boost in memory utilization? commit fc4d6d136d42fab207b3ce20a8ebfd61a13f931f Author: Dave Ertman <david.m.ertman@xxxxxxxxx> Date: Mon Dec 11 13:19:28 2023 -0800 ice: alter feature support check for SRIOV and LAG [ Upstream commit 4d50fcdc2476eef94c14c6761073af5667bb43b6 ] Previously, the ice driver had support for using a handler for bonding netdev events to ensure that conflicting features were not allowed to be activated at the same time. While this was still in place, additional support was added to specifically support SRIOV and LAG together. These both utilized the netdev event handler, but the SRIOV and LAG feature was behind a capabilities feature check to make sure the current NVM has support. The exclusion part of the event handler should be removed since there are users who have custom made solutions that depend on the non-exclusion of features. Wrap the creation/registration and cleanup of the event handler and associated structs in the probe flow with a feature check so that the only systems that support the full implementation of LAG features will initialize support. This will leave other systems unhindered with functionality as it existed before any LAG code was added.