> On Wed, Nov 22, 2023 at 12:31 AM Jaroslav Pulchart <jaroslav.pulchart@xxxxxxxxxxxx> wrote: >> >> > >> > > >> > > On Mon, Nov 20, 2023 at 1:42 AM Jaroslav Pulchart >> > > <jaroslav.pulchart@xxxxxxxxxxxx> wrote: >> > > > >> > > > > On Tue, Nov 14, 2023 at 12:30 AM Jaroslav Pulchart >> > > > > <jaroslav.pulchart@xxxxxxxxxxxx> wrote: >> > > > > > >> > > > > > > >> > > > > > > On Mon, Nov 13, 2023 at 1:36 AM Jaroslav Pulchart >> > > > > > > <jaroslav.pulchart@xxxxxxxxxxxx> wrote: >> > > > > > > > >> > > > > > > > > >> > > > > > > > > On Thu, Nov 9, 2023 at 3:58 AM Jaroslav Pulchart >> > > > > > > > > <jaroslav.pulchart@xxxxxxxxxxxx> wrote: >> > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > On Wed, Nov 8, 2023 at 10:39 PM Jaroslav Pulchart >> > > > > > > > > > > <jaroslav.pulchart@xxxxxxxxxxxx> wrote: >> > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart >> > > > > > > > > > > > > <jaroslav.pulchart@xxxxxxxxxxxx> wrote: >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Hi Jaroslav, >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > Hi Yu Zhao >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > thanks for response, see answers inline: >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart >> > > > > > > > > > > > > > > <jaroslav.pulchart@xxxxxxxxxxxx> wrote: >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > Hello, >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > I would like to report to you an unpleasant behavior of multi-gen LRU >> > > > > > > > > > > > > > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3 >> > > > > > > > > > > > > > > > system (16numa domains). >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Kernel version please? >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > 6.5.y, but we saw it sooner as it is in investigation from 23th May >> > > > > > > > > > > > > > (6.4.y and maybe even the 6.3.y). >> > > > > > > > > > > > > >> > > > > > > > > > > > > v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5 >> > > > > > > > > > > > > for you if you run into other problems with v6.6. >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > I will give it a try using 6.6.y. When it will work we can switch to >> > > > > > > > > > > > 6.6.y instead of backporting the stuff to 6.5.y. >> > > > > > > > > > > > >> > > > > > > > > > > > > > > > Symptoms of my issue are >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > /A/ if mult-gen LRU is enabled >> > > > > > > > > > > > > > > > 1/ [kswapd3] is consuming 100% CPU >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Just thinking out loud: kswapd3 means the fourth node was under memory pressure. >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > top - 15:03:11 up 34 days, 1:51, 2 users, load average: 23.34, >> > > > > > > > > > > > > > > > 18.26, 15.01 >> > > > > > > > > > > > > > > > Tasks: 1226 total, 2 running, 1224 sleeping, 0 stopped, 0 zombie >> > > > > > > > > > > > > > > > %Cpu(s): 12.5 us, 4.7 sy, 0.0 ni, 82.1 id, 0.0 wa, 0.4 hi, >> > > > > > > > > > > > > > > > 0.4 si, 0.0 st >> > > > > > > > > > > > > > > > MiB Mem : 1047265.+total, 28382.7 free, 1021308.+used, 767.6 buff/cache >> > > > > > > > > > > > > > > > MiB Swap: 8192.0 total, 8187.7 free, 4.2 used. 25956.7 avail Mem >> > > > > > > > > > > > > > > > ... >> > > > > > > > > > > > > > > > 765 root 20 0 0 0 0 R 98.3 0.0 >> > > > > > > > > > > > > > > > 34969:04 kswapd3 >> > > > > > > > > > > > > > > > ... >> > > > > > > > > > > > > > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was >> > > > > > > > > > > > > > > > observed with swap disk as well and cause IO latency issues due to >> > > > > > > > > > > > > > > > some kind of locking) >> > > > > > > > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > /B/ if mult-gen LRU is disabled >> > > > > > > > > > > > > > > > 1/ [kswapd3] is consuming 3%-10% CPU >> > > > > > > > > > > > > > > > top - 15:02:49 up 34 days, 1:51, 2 users, load average: 23.05, >> > > > > > > > > > > > > > > > 17.77, 14.77 >> > > > > > > > > > > > > > > > Tasks: 1226 total, 1 running, 1225 sleeping, 0 stopped, 0 zombie >> > > > > > > > > > > > > > > > %Cpu(s): 14.7 us, 2.8 sy, 0.0 ni, 81.8 id, 0.0 wa, 0.4 hi, >> > > > > > > > > > > > > > > > 0.4 si, 0.0 st >> > > > > > > > > > > > > > > > MiB Mem : 1047265.+total, 28378.5 free, 1021313.+used, 767.3 buff/cache >> > > > > > > > > > > > > > > > MiB Swap: 8192.0 total, 8189.0 free, 3.0 used. 25952.4 avail Mem >> > > > > > > > > > > > > > > > ... >> > > > > > > > > > > > > > > > 765 root 20 0 0 0 0 S 3.6 0.0 >> > > > > > > > > > > > > > > > 34966:46 [kswapd3] >> > > > > > > > > > > > > > > > ... >> > > > > > > > > > > > > > > > 2/ swap space usage is low (4MB) >> > > > > > > > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > Both situations are wrong as they are using swap in/out extensively, >> > > > > > > > > > > > > > > > however the multi-gen LRU situation is 10times worse. >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > From the stats below, node 3 had the lowest free memory. So I think in >> > > > > > > > > > > > > > > both cases, the reclaim activities were as expected. >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > I do not see a reason for the memory pressure and reclaims. This node >> > > > > > > > > > > > > > has the lowest free memory of all nodes (~302MB free) that is true, >> > > > > > > > > > > > > > however the swap space usage is just 4MB (still going in and out). So >> > > > > > > > > > > > > > what can be the reason for that behaviour? >> > > > > > > > > > > > > >> > > > > > > > > > > > > The best analogy is that refuel (reclaim) happens before the tank >> > > > > > > > > > > > > becomes empty, and it happens even sooner when there is a long road >> > > > > > > > > > > > > ahead (high order allocations). >> > > > > > > > > > > > > >> > > > > > > > > > > > > > The workers/application is running in pre-allocated HugePages and the >> > > > > > > > > > > > > > rest is used for a small set of system services and drivers of >> > > > > > > > > > > > > > devices. It is static and not growing. The issue persists when I stop >> > > > > > > > > > > > > > the system services and free the memory. >> > > > > > > > > > > > > >> > > > > > > > > > > > > Yes, this helps. >> > > > > > > > > > > > > Also could you attach /proc/buddyinfo from the moment >> > > > > > > > > > > > > you hit the problem? >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > I can. The problem is continuous, it is 100% of time continuously >> > > > > > > > > > > > doing in/out and consuming 100% of CPU and locking IO. >> > > > > > > > > > > > >> > > > > > > > > > > > The output of /proc/buddyinfo is: >> > > > > > > > > > > > >> > > > > > > > > > > > # cat /proc/buddyinfo >> > > > > > > > > > > > Node 0, zone DMA 7 2 2 1 1 2 1 >> > > > > > > > > > > > 1 1 2 1 >> > > > > > > > > > > > Node 0, zone DMA32 4567 3395 1357 846 439 190 93 >> > > > > > > > > > > > 61 43 23 4 >> > > > > > > > > > > > Node 0, zone Normal 19 190 140 129 136 75 66 >> > > > > > > > > > > > 41 9 1 5 >> > > > > > > > > > > > Node 1, zone Normal 194 1210 2080 1800 715 255 111 >> > > > > > > > > > > > 56 42 36 55 >> > > > > > > > > > > > Node 2, zone Normal 204 768 3766 3394 1742 468 185 >> > > > > > > > > > > > 194 238 47 74 >> > > > > > > > > > > > Node 3, zone Normal 1622 2137 1058 846 388 208 97 >> > > > > > > > > > > > 44 14 42 10 >> > > > > > > > > > > >> > > > > > > > > > > Again, thinking out loud: there is only one zone on node 3, i.e., the >> > > > > > > > > > > normal zone, and this excludes the problem commit >> > > > > > > > > > > 669281ee7ef731fb5204df9d948669bf32a5e68d ("Multi-gen LRU: fix per-zone >> > > > > > > > > > > reclaim") fixed in v6.6. >> > > > > > > > > > >> > > > > > > > > > I built vanila 6.6.1 and did the first fast test - spin up and destroy >> > > > > > > > > > VMs only - This test does not always trigger the kswapd3 continuous >> > > > > > > > > > swap in/out usage but it uses it and it looks like there is a >> > > > > > > > > > change: >> > > > > > > > > > >> > > > > > > > > > I can see kswapd non-continous (15s and more) usage with 6.5.y >> > > > > > > > > > # ps ax | grep [k]swapd >> > > > > > > > > > 753 ? S 0:00 [kswapd0] >> > > > > > > > > > 754 ? S 0:00 [kswapd1] >> > > > > > > > > > 755 ? S 0:00 [kswapd2] >> > > > > > > > > > 756 ? S 0:15 [kswapd3] <<<<<<<<< >> > > > > > > > > > 757 ? S 0:00 [kswapd4] >> > > > > > > > > > 758 ? S 0:00 [kswapd5] >> > > > > > > > > > 759 ? S 0:00 [kswapd6] >> > > > > > > > > > 760 ? S 0:00 [kswapd7] >> > > > > > > > > > 761 ? S 0:00 [kswapd8] >> > > > > > > > > > 762 ? S 0:00 [kswapd9] >> > > > > > > > > > 763 ? S 0:00 [kswapd10] >> > > > > > > > > > 764 ? S 0:00 [kswapd11] >> > > > > > > > > > 765 ? S 0:00 [kswapd12] >> > > > > > > > > > 766 ? S 0:00 [kswapd13] >> > > > > > > > > > 767 ? S 0:00 [kswapd14] >> > > > > > > > > > 768 ? S 0:00 [kswapd15] >> > > > > > > > > > >> > > > > > > > > > and none kswapd usage with 6.6.1, that looks to be promising path >> > > > > > > > > > >> > > > > > > > > > # ps ax | grep [k]swapd >> > > > > > > > > > 808 ? S 0:00 [kswapd0] >> > > > > > > > > > 809 ? S 0:00 [kswapd1] >> > > > > > > > > > 810 ? S 0:00 [kswapd2] >> > > > > > > > > > 811 ? S 0:00 [kswapd3] <<<< nice >> > > > > > > > > > 812 ? S 0:00 [kswapd4] >> > > > > > > > > > 813 ? S 0:00 [kswapd5] >> > > > > > > > > > 814 ? S 0:00 [kswapd6] >> > > > > > > > > > 815 ? S 0:00 [kswapd7] >> > > > > > > > > > 816 ? S 0:00 [kswapd8] >> > > > > > > > > > 817 ? S 0:00 [kswapd9] >> > > > > > > > > > 818 ? S 0:00 [kswapd10] >> > > > > > > > > > 819 ? S 0:00 [kswapd11] >> > > > > > > > > > 820 ? S 0:00 [kswapd12] >> > > > > > > > > > 821 ? S 0:00 [kswapd13] >> > > > > > > > > > 822 ? S 0:00 [kswapd14] >> > > > > > > > > > 823 ? S 0:00 [kswapd15] >> > > > > > > > > > >> > > > > > > > > > I will install the 6.6.1 on the server which is doing some work and >> > > > > > > > > > observe it later today. >> > > > > > > > > >> > > > > > > > > Thanks. Fingers crossed. >> > > > > > > > >> > > > > > > > The 6.6.y was deployed and used from 9th Nov 3PM CEST. So far so good. >> > > > > > > > The node 3 has 163MiB free of memory and I see >> > > > > > > > just a few in/out swap usage sometimes (which is expected) and minimal >> > > > > > > > kswapd3 process usage for almost 4days. >> > > > > > > >> > > > > > > Thanks for the update! >> > > > > > > >> > > > > > > Just to confirm: >> > > > > > > 1. MGLRU was enabled, and >> > > > > > >> > > > > > Yes, MGLRU is enabled >> > > > > > >> > > > > > > 2. The v6.6 deployed did NOT have the patch I attached earlier. >> > > > > > >> > > > > > Vanila 6.6, attached patch NOT applied. >> > > > > > >> > > > > > > Are both correct? >> > > > > > > >> > > > > > > If so, I'd very appreciate it if you could try the attached patch on >> > > > > > > top of v6.5 and see if it helps. My suspicion is that the problem is >> > > > > > > compaction related, i.e., kswapd was woken up by high order >> > > > > > > allocations but didn't properly stop. But what causes the behavior >> > > > > > >> > > > > > Sure, I can try it. Will inform you about progress. >> > > > > >> > > > > Thanks! >> > > > > >> > > > > > > difference on v6.5 between MGLRU and the active/inactive LRU still >> > > > > > > puzzles me --the problem might be somehow masked rather than fixed on >> > > > > > > v6.6. >> > > > > > >> > > > > > I'm not sure how I can help with the issue. Any suggestions on what to >> > > > > > change/try? >> > > > > >> > > > > Trying the attached patch is good enough for now :) >> > > > >> > > > So far I'm running the "6.5.y + patch" for 4 days without triggering >> > > > the infinite swap in//out usage. >> > > > >> > > > I'm observing a similar pattern in kswapd usage - "if it uses kswapd, >> > > > then it is in majority the kswapd3 - like the vanila 6.5.y which is >> > > > not observed with 6.6.y, (The Node's 3 free mem is 159 MB) >> > > > # ps ax | grep [k]swapd >> > > > 750 ? S 0:00 [kswapd0] >> > > > 751 ? S 0:00 [kswapd1] >> > > > 752 ? S 0:00 [kswapd2] >> > > > 753 ? S 0:02 [kswapd3] <<<< it uses kswapd3, good >> > > > is that it is not continuous >> > > > 754 ? S 0:00 [kswapd4] >> > > > 755 ? S 0:00 [kswapd5] >> > > > 756 ? S 0:00 [kswapd6] >> > > > 757 ? S 0:00 [kswapd7] >> > > > 758 ? S 0:00 [kswapd8] >> > > > 759 ? S 0:00 [kswapd9] >> > > > 760 ? S 0:00 [kswapd10] >> > > > 761 ? S 0:00 [kswapd11] >> > > > 762 ? S 0:00 [kswapd12] >> > > > 763 ? S 0:00 [kswapd13] >> > > > 764 ? S 0:00 [kswapd14] >> > > > 765 ? S 0:00 [kswapd15] >> > > > >> > > > Good stuff is that the system did not end in a continuous loop of swap >> > > > in/out usage (at least so far) which is great. See attached >> > > > swap_in_out_good_vs_bad.png. I will keep it running for the next 3 >> > > > days. >> > > >> > > Thanks again, Jaroslav! >> > > >> > > Just a note here: I suspect the problem still exists on v6.6 but >> > > somehow is masked, possibly by reduced memory usage from the kernel >> > > itself and more free memory for userspace. So to be on the safe side, >> > > I'll post the patch and credit you as the reporter and tester. >> > >> > Morning, let's wait. I reviewed the graph and the swap in/out started >> > to be happening from 1:50 AM CET. Slower than before (util of cpu >> > 0.3%) but it is doing in/out see attached png. >> >> I investigated it more, there was an operation issue and the system >> disabled multi-gen lru yesterday ~10 AM CET (our temporary workaround >> for this problem) by >> echo N > /sys/kernel/mm/lru_gen/enabled >> when an alert was triggered by an unexpected setup of the server. >> Could it be that the patch is not functional if lru_gen/enabled is >> 0x0000? > > > That’s correct. > >> I need to reboot the system and do the whole week's test again. > > > Thanks a lot! The server with 6.5.y + lru patch is stable, no continuous swap in/out is observed in the last 7days! I assume the fix is correct. Can you share with me the final patch for 6.6.y, I will use in our kernel builds till it is in the upstream.