On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart <jaroslav.pulchart@xxxxxxxxxxxx> wrote: > > > > > Hi Jaroslav, > > Hi Yu Zhao > > thanks for response, see answers inline: > > > > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart > > <jaroslav.pulchart@xxxxxxxxxxxx> wrote: > > > > > > Hello, > > > > > > I would like to report to you an unpleasant behavior of multi-gen LRU > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3 > > > system (16numa domains). > > > > Kernel version please? > > 6.5.y, but we saw it sooner as it is in investigation from 23th May > (6.4.y and maybe even the 6.3.y). v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5 for you if you run into other problems with v6.6. > > > Symptoms of my issue are > > > > > > /A/ if mult-gen LRU is enabled > > > 1/ [kswapd3] is consuming 100% CPU > > > > Just thinking out loud: kswapd3 means the fourth node was under memory pressure. > > > > > top - 15:03:11 up 34 days, 1:51, 2 users, load average: 23.34, > > > 18.26, 15.01 > > > Tasks: 1226 total, 2 running, 1224 sleeping, 0 stopped, 0 zombie > > > %Cpu(s): 12.5 us, 4.7 sy, 0.0 ni, 82.1 id, 0.0 wa, 0.4 hi, > > > 0.4 si, 0.0 st > > > MiB Mem : 1047265.+total, 28382.7 free, 1021308.+used, 767.6 buff/cache > > > MiB Swap: 8192.0 total, 8187.7 free, 4.2 used. 25956.7 avail Mem > > > ... > > > 765 root 20 0 0 0 0 R 98.3 0.0 > > > 34969:04 kswapd3 > > > ... > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was > > > observed with swap disk as well and cause IO latency issues due to > > > some kind of locking) > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out > > > > > > > > > /B/ if mult-gen LRU is disabled > > > 1/ [kswapd3] is consuming 3%-10% CPU > > > top - 15:02:49 up 34 days, 1:51, 2 users, load average: 23.05, > > > 17.77, 14.77 > > > Tasks: 1226 total, 1 running, 1225 sleeping, 0 stopped, 0 zombie > > > %Cpu(s): 14.7 us, 2.8 sy, 0.0 ni, 81.8 id, 0.0 wa, 0.4 hi, > > > 0.4 si, 0.0 st > > > MiB Mem : 1047265.+total, 28378.5 free, 1021313.+used, 767.3 buff/cache > > > MiB Swap: 8192.0 total, 8189.0 free, 3.0 used. 25952.4 avail Mem > > > ... > > > 765 root 20 0 0 0 0 S 3.6 0.0 > > > 34966:46 [kswapd3] > > > ... > > > 2/ swap space usage is low (4MB) > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out > > > > > > Both situations are wrong as they are using swap in/out extensively, > > > however the multi-gen LRU situation is 10times worse. > > > > From the stats below, node 3 had the lowest free memory. So I think in > > both cases, the reclaim activities were as expected. > > I do not see a reason for the memory pressure and reclaims. This node > has the lowest free memory of all nodes (~302MB free) that is true, > however the swap space usage is just 4MB (still going in and out). So > what can be the reason for that behaviour? The best analogy is that refuel (reclaim) happens before the tank becomes empty, and it happens even sooner when there is a long road ahead (high order allocations). > The workers/application is running in pre-allocated HugePages and the > rest is used for a small set of system services and drivers of > devices. It is static and not growing. The issue persists when I stop > the system services and free the memory. Yes, this helps. Also could you attach /proc/buddyinfo from the moment you hit the problem? > > > Could I ask for any suggestions on how to avoid the kswapd utilization > > > pattern? > > > > The easiest way is to disable NUMA domain so that there would be only > > two nodes with 8x more memory. IOW, you have fewer pools but each pool > > has more memory and therefore they are less likely to become empty. > > > > > There is a free RAM in each numa node for the few MB used in > > > swap: > > > NUMA stats: > > > NUMA nodes: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 > > > MemTotal: 65048 65486 65486 65486 65486 65486 65486 65469 65486 > > > 65486 65486 65486 65486 65486 65486 65424 > > > MemFree: 468 601 1200 302 548 1879 2321 2478 1967 2239 1453 2417 > > > 2623 2833 2530 2269 > > > the in/out usage does not make sense for me nor the CPU utilization by > > > multi-gen LRU. > > > > My questions: > > 1. Were there any OOM kills with either case? > > There is no OOM. The memory usage is not growing nor the swap space > usage, it is still a few MB there. > > > 2. Was THP enabled? > > Both situations with enabled and with disabled THP. My suspicion is that you packed the node 3 too perfectly :) And that might have triggered a known but currently a low priority problem in MGLRU. I'm attaching a patch for v6.6 and hoping you could verify it for me in case v6.6 by itself still has the problem? > > MGLRU might have spent the extra CPU cycles just to void OOM kills or > > produce more THPs. > > > > If disabling the NUMA domain isn't an option, I'd recommend: > > Disabling numa is not an option. However we are now testing a setup > with -1GB in HugePages per each numa. > > > 1. Try the latest kernel (6.6.1) if you haven't. > > Not yet, the 6.6.1 was released today. > > > 2. Disable THP if it was enabled, to verify whether it has an impact. > > I try disabling THP without any effect. Gochat. Please try the patch with MGLRU and let me know. Thanks! (Also CC Charan @ Qualcomm who initially reported the problem that ended up with the attached patch.)
Attachment:
0001-mm-mglru-curb-kswapd-overshooting-high-wmarks.patch
Description: Binary data