On Mon, Nov 20, 2023 at 1:42 AM Jaroslav Pulchart <jaroslav.pulchart@xxxxxxxxxxxx> wrote: > > > On Tue, Nov 14, 2023 at 12:30 AM Jaroslav Pulchart > > <jaroslav.pulchart@xxxxxxxxxxxx> wrote: > > > > > > > > > > > On Mon, Nov 13, 2023 at 1:36 AM Jaroslav Pulchart > > > > <jaroslav.pulchart@xxxxxxxxxxxx> wrote: > > > > > > > > > > > > > > > > > On Thu, Nov 9, 2023 at 3:58 AM Jaroslav Pulchart > > > > > > <jaroslav.pulchart@xxxxxxxxxxxx> wrote: > > > > > > > > > > > > > > > > > > > > > > > On Wed, Nov 8, 2023 at 10:39 PM Jaroslav Pulchart > > > > > > > > <jaroslav.pulchart@xxxxxxxxxxxx> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart > > > > > > > > > > <jaroslav.pulchart@xxxxxxxxxxxx> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi Jaroslav, > > > > > > > > > > > > > > > > > > > > > > Hi Yu Zhao > > > > > > > > > > > > > > > > > > > > > > thanks for response, see answers inline: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart > > > > > > > > > > > > <jaroslav.pulchart@xxxxxxxxxxxx> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > > > > > > > > > > > > > I would like to report to you an unpleasant behavior of multi-gen LRU > > > > > > > > > > > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3 > > > > > > > > > > > > > system (16numa domains). > > > > > > > > > > > > > > > > > > > > > > > > Kernel version please? > > > > > > > > > > > > > > > > > > > > > > 6.5.y, but we saw it sooner as it is in investigation from 23th May > > > > > > > > > > > (6.4.y and maybe even the 6.3.y). > > > > > > > > > > > > > > > > > > > > v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5 > > > > > > > > > > for you if you run into other problems with v6.6. > > > > > > > > > > > > > > > > > > > > > > > > > > > > I will give it a try using 6.6.y. When it will work we can switch to > > > > > > > > > 6.6.y instead of backporting the stuff to 6.5.y. > > > > > > > > > > > > > > > > > > > > > > Symptoms of my issue are > > > > > > > > > > > > > > > > > > > > > > > > > > /A/ if mult-gen LRU is enabled > > > > > > > > > > > > > 1/ [kswapd3] is consuming 100% CPU > > > > > > > > > > > > > > > > > > > > > > > > Just thinking out loud: kswapd3 means the fourth node was under memory pressure. > > > > > > > > > > > > > > > > > > > > > > > > > top - 15:03:11 up 34 days, 1:51, 2 users, load average: 23.34, > > > > > > > > > > > > > 18.26, 15.01 > > > > > > > > > > > > > Tasks: 1226 total, 2 running, 1224 sleeping, 0 stopped, 0 zombie > > > > > > > > > > > > > %Cpu(s): 12.5 us, 4.7 sy, 0.0 ni, 82.1 id, 0.0 wa, 0.4 hi, > > > > > > > > > > > > > 0.4 si, 0.0 st > > > > > > > > > > > > > MiB Mem : 1047265.+total, 28382.7 free, 1021308.+used, 767.6 buff/cache > > > > > > > > > > > > > MiB Swap: 8192.0 total, 8187.7 free, 4.2 used. 25956.7 avail Mem > > > > > > > > > > > > > ... > > > > > > > > > > > > > 765 root 20 0 0 0 0 R 98.3 0.0 > > > > > > > > > > > > > 34969:04 kswapd3 > > > > > > > > > > > > > ... > > > > > > > > > > > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was > > > > > > > > > > > > > observed with swap disk as well and cause IO latency issues due to > > > > > > > > > > > > > some kind of locking) > > > > > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > /B/ if mult-gen LRU is disabled > > > > > > > > > > > > > 1/ [kswapd3] is consuming 3%-10% CPU > > > > > > > > > > > > > top - 15:02:49 up 34 days, 1:51, 2 users, load average: 23.05, > > > > > > > > > > > > > 17.77, 14.77 > > > > > > > > > > > > > Tasks: 1226 total, 1 running, 1225 sleeping, 0 stopped, 0 zombie > > > > > > > > > > > > > %Cpu(s): 14.7 us, 2.8 sy, 0.0 ni, 81.8 id, 0.0 wa, 0.4 hi, > > > > > > > > > > > > > 0.4 si, 0.0 st > > > > > > > > > > > > > MiB Mem : 1047265.+total, 28378.5 free, 1021313.+used, 767.3 buff/cache > > > > > > > > > > > > > MiB Swap: 8192.0 total, 8189.0 free, 3.0 used. 25952.4 avail Mem > > > > > > > > > > > > > ... > > > > > > > > > > > > > 765 root 20 0 0 0 0 S 3.6 0.0 > > > > > > > > > > > > > 34966:46 [kswapd3] > > > > > > > > > > > > > ... > > > > > > > > > > > > > 2/ swap space usage is low (4MB) > > > > > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out > > > > > > > > > > > > > > > > > > > > > > > > > > Both situations are wrong as they are using swap in/out extensively, > > > > > > > > > > > > > however the multi-gen LRU situation is 10times worse. > > > > > > > > > > > > > > > > > > > > > > > > From the stats below, node 3 had the lowest free memory. So I think in > > > > > > > > > > > > both cases, the reclaim activities were as expected. > > > > > > > > > > > > > > > > > > > > > > I do not see a reason for the memory pressure and reclaims. This node > > > > > > > > > > > has the lowest free memory of all nodes (~302MB free) that is true, > > > > > > > > > > > however the swap space usage is just 4MB (still going in and out). So > > > > > > > > > > > what can be the reason for that behaviour? > > > > > > > > > > > > > > > > > > > > The best analogy is that refuel (reclaim) happens before the tank > > > > > > > > > > becomes empty, and it happens even sooner when there is a long road > > > > > > > > > > ahead (high order allocations). > > > > > > > > > > > > > > > > > > > > > The workers/application is running in pre-allocated HugePages and the > > > > > > > > > > > rest is used for a small set of system services and drivers of > > > > > > > > > > > devices. It is static and not growing. The issue persists when I stop > > > > > > > > > > > the system services and free the memory. > > > > > > > > > > > > > > > > > > > > Yes, this helps. > > > > > > > > > > Also could you attach /proc/buddyinfo from the moment > > > > > > > > > > you hit the problem? > > > > > > > > > > > > > > > > > > > > > > > > > > > > I can. The problem is continuous, it is 100% of time continuously > > > > > > > > > doing in/out and consuming 100% of CPU and locking IO. > > > > > > > > > > > > > > > > > > The output of /proc/buddyinfo is: > > > > > > > > > > > > > > > > > > # cat /proc/buddyinfo > > > > > > > > > Node 0, zone DMA 7 2 2 1 1 2 1 > > > > > > > > > 1 1 2 1 > > > > > > > > > Node 0, zone DMA32 4567 3395 1357 846 439 190 93 > > > > > > > > > 61 43 23 4 > > > > > > > > > Node 0, zone Normal 19 190 140 129 136 75 66 > > > > > > > > > 41 9 1 5 > > > > > > > > > Node 1, zone Normal 194 1210 2080 1800 715 255 111 > > > > > > > > > 56 42 36 55 > > > > > > > > > Node 2, zone Normal 204 768 3766 3394 1742 468 185 > > > > > > > > > 194 238 47 74 > > > > > > > > > Node 3, zone Normal 1622 2137 1058 846 388 208 97 > > > > > > > > > 44 14 42 10 > > > > > > > > > > > > > > > > Again, thinking out loud: there is only one zone on node 3, i.e., the > > > > > > > > normal zone, and this excludes the problem commit > > > > > > > > 669281ee7ef731fb5204df9d948669bf32a5e68d ("Multi-gen LRU: fix per-zone > > > > > > > > reclaim") fixed in v6.6. > > > > > > > > > > > > > > I built vanila 6.6.1 and did the first fast test - spin up and destroy > > > > > > > VMs only - This test does not always trigger the kswapd3 continuous > > > > > > > swap in/out usage but it uses it and it looks like there is a > > > > > > > change: > > > > > > > > > > > > > > I can see kswapd non-continous (15s and more) usage with 6.5.y > > > > > > > # ps ax | grep [k]swapd > > > > > > > 753 ? S 0:00 [kswapd0] > > > > > > > 754 ? S 0:00 [kswapd1] > > > > > > > 755 ? S 0:00 [kswapd2] > > > > > > > 756 ? S 0:15 [kswapd3] <<<<<<<<< > > > > > > > 757 ? S 0:00 [kswapd4] > > > > > > > 758 ? S 0:00 [kswapd5] > > > > > > > 759 ? S 0:00 [kswapd6] > > > > > > > 760 ? S 0:00 [kswapd7] > > > > > > > 761 ? S 0:00 [kswapd8] > > > > > > > 762 ? S 0:00 [kswapd9] > > > > > > > 763 ? S 0:00 [kswapd10] > > > > > > > 764 ? S 0:00 [kswapd11] > > > > > > > 765 ? S 0:00 [kswapd12] > > > > > > > 766 ? S 0:00 [kswapd13] > > > > > > > 767 ? S 0:00 [kswapd14] > > > > > > > 768 ? S 0:00 [kswapd15] > > > > > > > > > > > > > > and none kswapd usage with 6.6.1, that looks to be promising path > > > > > > > > > > > > > > # ps ax | grep [k]swapd > > > > > > > 808 ? S 0:00 [kswapd0] > > > > > > > 809 ? S 0:00 [kswapd1] > > > > > > > 810 ? S 0:00 [kswapd2] > > > > > > > 811 ? S 0:00 [kswapd3] <<<< nice > > > > > > > 812 ? S 0:00 [kswapd4] > > > > > > > 813 ? S 0:00 [kswapd5] > > > > > > > 814 ? S 0:00 [kswapd6] > > > > > > > 815 ? S 0:00 [kswapd7] > > > > > > > 816 ? S 0:00 [kswapd8] > > > > > > > 817 ? S 0:00 [kswapd9] > > > > > > > 818 ? S 0:00 [kswapd10] > > > > > > > 819 ? S 0:00 [kswapd11] > > > > > > > 820 ? S 0:00 [kswapd12] > > > > > > > 821 ? S 0:00 [kswapd13] > > > > > > > 822 ? S 0:00 [kswapd14] > > > > > > > 823 ? S 0:00 [kswapd15] > > > > > > > > > > > > > > I will install the 6.6.1 on the server which is doing some work and > > > > > > > observe it later today. > > > > > > > > > > > > Thanks. Fingers crossed. > > > > > > > > > > The 6.6.y was deployed and used from 9th Nov 3PM CEST. So far so good. > > > > > The node 3 has 163MiB free of memory and I see > > > > > just a few in/out swap usage sometimes (which is expected) and minimal > > > > > kswapd3 process usage for almost 4days. > > > > > > > > Thanks for the update! > > > > > > > > Just to confirm: > > > > 1. MGLRU was enabled, and > > > > > > Yes, MGLRU is enabled > > > > > > > 2. The v6.6 deployed did NOT have the patch I attached earlier. > > > > > > Vanila 6.6, attached patch NOT applied. > > > > > > > Are both correct? > > > > > > > > If so, I'd very appreciate it if you could try the attached patch on > > > > top of v6.5 and see if it helps. My suspicion is that the problem is > > > > compaction related, i.e., kswapd was woken up by high order > > > > allocations but didn't properly stop. But what causes the behavior > > > > > > Sure, I can try it. Will inform you about progress. > > > > Thanks! > > > > > > difference on v6.5 between MGLRU and the active/inactive LRU still > > > > puzzles me --the problem might be somehow masked rather than fixed on > > > > v6.6. > > > > > > I'm not sure how I can help with the issue. Any suggestions on what to > > > change/try? > > > > Trying the attached patch is good enough for now :) > > So far I'm running the "6.5.y + patch" for 4 days without triggering > the infinite swap in//out usage. > > I'm observing a similar pattern in kswapd usage - "if it uses kswapd, > then it is in majority the kswapd3 - like the vanila 6.5.y which is > not observed with 6.6.y, (The Node's 3 free mem is 159 MB) > # ps ax | grep [k]swapd > 750 ? S 0:00 [kswapd0] > 751 ? S 0:00 [kswapd1] > 752 ? S 0:00 [kswapd2] > 753 ? S 0:02 [kswapd3] <<<< it uses kswapd3, good > is that it is not continuous > 754 ? S 0:00 [kswapd4] > 755 ? S 0:00 [kswapd5] > 756 ? S 0:00 [kswapd6] > 757 ? S 0:00 [kswapd7] > 758 ? S 0:00 [kswapd8] > 759 ? S 0:00 [kswapd9] > 760 ? S 0:00 [kswapd10] > 761 ? S 0:00 [kswapd11] > 762 ? S 0:00 [kswapd12] > 763 ? S 0:00 [kswapd13] > 764 ? S 0:00 [kswapd14] > 765 ? S 0:00 [kswapd15] > > Good stuff is that the system did not end in a continuous loop of swap > in/out usage (at least so far) which is great. See attached > swap_in_out_good_vs_bad.png. I will keep it running for the next 3 > days. Thanks again, Jaroslav! Just a note here: I suspect the problem still exists on v6.6 but somehow is masked, possibly by reduced memory usage from the kernel itself and more free memory for userspace. So to be on the safe side, I'll post the patch and credit you as the reporter and tester.