Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU

Yu Zhao <yuzhao@xxxxxxxxxx> · Wed, 22 Nov 2023 07:18:49 -0700

On Wed, Nov 22, 2023 at 12:31 AM Jaroslav Pulchart <jaroslav.pulchart@xxxxxxxxxxxx> wrote:
>

> >

> > On Mon, Nov 20, 2023 at 1:42 AM Jaroslav Pulchart

> > <jaroslav.pulchart@xxxxxxxxxxxx> wrote:

> > >

> > > > On Tue, Nov 14, 2023 at 12:30 AM Jaroslav Pulchart

> > > > <jaroslav.pulchart@xxxxxxxxxxxx> wrote:

> > > > >

> > > > > >

> > > > > > On Mon, Nov 13, 2023 at 1:36 AM Jaroslav Pulchart

> > > > > > <jaroslav.pulchart@xxxxxxxxxxxx> wrote:

> > > > > > >

> > > > > > > >

> > > > > > > > On Thu, Nov 9, 2023 at 3:58 AM Jaroslav Pulchart

> > > > > > > > <jaroslav.pulchart@xxxxxxxxxxxx> wrote:

> > > > > > > > >

> > > > > > > > > >

> > > > > > > > > > On Wed, Nov 8, 2023 at 10:39 PM Jaroslav Pulchart

> > > > > > > > > > <jaroslav.pulchart@xxxxxxxxxxxx> wrote:

> > > > > > > > > > >

> > > > > > > > > > > >

> > > > > > > > > > > > On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart

> > > > > > > > > > > > <jaroslav.pulchart@xxxxxxxxxxxx> wrote:

> > > > > > > > > > > > >

> > > > > > > > > > > > > >

> > > > > > > > > > > > > > Hi Jaroslav,

> > > > > > > > > > > > >

> > > > > > > > > > > > > Hi Yu Zhao

> > > > > > > > > > > > >

> > > > > > > > > > > > > thanks for response, see answers inline:

> > > > > > > > > > > > >

> > > > > > > > > > > > > >

> > > > > > > > > > > > > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart

> > > > > > > > > > > > > > <jaroslav.pulchart@xxxxxxxxxxxx> wrote:

> > > > > > > > > > > > > > >

> > > > > > > > > > > > > > > Hello,

> > > > > > > > > > > > > > >

> > > > > > > > > > > > > > > I would like to report to you an unpleasant behavior of multi-gen LRU

> > > > > > > > > > > > > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3

> > > > > > > > > > > > > > > system (16numa domains).

> > > > > > > > > > > > > >

> > > > > > > > > > > > > > Kernel version please?

> > > > > > > > > > > > >

> > > > > > > > > > > > > 6.5.y, but we saw it sooner as it is in investigation from 23th May

> > > > > > > > > > > > > (6.4.y and maybe even the 6.3.y).

> > > > > > > > > > > >

> > > > > > > > > > > > v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5

> > > > > > > > > > > > for you if you run into other problems with v6.6.

> > > > > > > > > > > >

> > > > > > > > > > >

> > > > > > > > > > > I will give it a try using 6.6.y. When it will work we can switch to

> > > > > > > > > > > 6.6.y instead of backporting the stuff to 6.5.y.

> > > > > > > > > > >

> > > > > > > > > > > > > > > Symptoms of my issue are

> > > > > > > > > > > > > > >

> > > > > > > > > > > > > > > /A/ if mult-gen LRU is enabled

> > > > > > > > > > > > > > > 1/ [kswapd3] is consuming 100% CPU

> > > > > > > > > > > > > >

> > > > > > > > > > > > > > Just thinking out loud: kswapd3 means the fourth node was under memory pressure.

> > > > > > > > > > > > > >

> > > > > > > > > > > > > > >     top - 15:03:11 up 34 days,  1:51,  2 users,  load average: 23.34,

> > > > > > > > > > > > > > > 18.26, 15.01

> > > > > > > > > > > > > > >     Tasks: 1226 total,   2 running, 1224 sleeping,   0 stopped,   0 zombie

> > > > > > > > > > > > > > >     %Cpu(s): 12.5 us,  4.7 sy,  0.0 ni, 82.1 id,  0.0 wa,  0.4 hi,

> > > > > > > > > > > > > > > 0.4 si,  0.0 st

> > > > > > > > > > > > > > >     MiB Mem : 1047265.+total,  28382.7 free, 1021308.+used,    767.6 buff/cache

> > > > > > > > > > > > > > >     MiB Swap:   8192.0 total,   8187.7 free,      4.2 used.  25956.7 avail Mem

> > > > > > > > > > > > > > >     ...

> > > > > > > > > > > > > > >         765 root      20   0       0      0      0 R  98.3   0.0

> > > > > > > > > > > > > > > 34969:04 kswapd3

> > > > > > > > > > > > > > >     ...

> > > > > > > > > > > > > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was

> > > > > > > > > > > > > > > observed with swap disk as well and cause IO latency issues due to

> > > > > > > > > > > > > > > some kind of locking)

> > > > > > > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out

> > > > > > > > > > > > > > >

> > > > > > > > > > > > > > >

> > > > > > > > > > > > > > > /B/ if mult-gen LRU is disabled

> > > > > > > > > > > > > > > 1/ [kswapd3] is consuming 3%-10% CPU

> > > > > > > > > > > > > > >     top - 15:02:49 up 34 days,  1:51,  2 users,  load average: 23.05,

> > > > > > > > > > > > > > > 17.77, 14.77

> > > > > > > > > > > > > > >     Tasks: 1226 total,   1 running, 1225 sleeping,   0 stopped,   0 zombie

> > > > > > > > > > > > > > >     %Cpu(s): 14.7 us,  2.8 sy,  0.0 ni, 81.8 id,  0.0 wa,  0.4 hi,

> > > > > > > > > > > > > > > 0.4 si,  0.0 st

> > > > > > > > > > > > > > >     MiB Mem : 1047265.+total,  28378.5 free, 1021313.+used,    767.3 buff/cache

> > > > > > > > > > > > > > >     MiB Swap:   8192.0 total,   8189.0 free,      3.0 used.  25952.4 avail Mem

> > > > > > > > > > > > > > >     ...

> > > > > > > > > > > > > > >        765 root      20   0       0      0      0 S   3.6   0.0

> > > > > > > > > > > > > > > 34966:46 [kswapd3]

> > > > > > > > > > > > > > >     ...

> > > > > > > > > > > > > > > 2/ swap space usage is low (4MB)

> > > > > > > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out

> > > > > > > > > > > > > > >

> > > > > > > > > > > > > > > Both situations are wrong as they are using swap in/out extensively,

> > > > > > > > > > > > > > > however the multi-gen LRU situation is 10times worse.

> > > > > > > > > > > > > >

> > > > > > > > > > > > > > From the stats below, node 3 had the lowest free memory. So I think in

> > > > > > > > > > > > > > both cases, the reclaim activities were as expected.

> > > > > > > > > > > > >

> > > > > > > > > > > > > I do not see a reason for the memory pressure and reclaims. This node

> > > > > > > > > > > > > has the lowest free memory of all nodes (~302MB free) that is true,

> > > > > > > > > > > > > however the swap space usage is just 4MB (still going in and out). So

> > > > > > > > > > > > > what can be the reason for that behaviour?

> > > > > > > > > > > >

> > > > > > > > > > > > The best analogy is that refuel (reclaim) happens before the tank

> > > > > > > > > > > > becomes empty, and it happens even sooner when there is a long road

> > > > > > > > > > > > ahead (high order allocations).

> > > > > > > > > > > >

> > > > > > > > > > > > > The workers/application is running in pre-allocated HugePages and the

> > > > > > > > > > > > > rest is used for a small set of system services and drivers of

> > > > > > > > > > > > > devices. It is static and not growing. The issue persists when I stop

> > > > > > > > > > > > > the system services and free the memory.

> > > > > > > > > > > >

> > > > > > > > > > > > Yes, this helps.

> > > > > > > > > > > >  Also could you attach /proc/buddyinfo from the moment

> > > > > > > > > > > > you hit the problem?

> > > > > > > > > > > >

> > > > > > > > > > >

> > > > > > > > > > > I can. The problem is continuous, it is 100% of time continuously

> > > > > > > > > > > doing in/out and consuming 100% of CPU and locking IO.

> > > > > > > > > > >

> > > > > > > > > > > The output of /proc/buddyinfo is:

> > > > > > > > > > >

> > > > > > > > > > > # cat /proc/buddyinfo

> > > > > > > > > > > Node 0, zone      DMA      7      2      2      1      1      2      1

> > > > > > > > > > >      1      1      2      1

> > > > > > > > > > > Node 0, zone    DMA32   4567   3395   1357    846    439    190     93

> > > > > > > > > > >     61     43     23      4

> > > > > > > > > > > Node 0, zone   Normal     19    190    140    129    136     75     66

> > > > > > > > > > >     41      9      1      5

> > > > > > > > > > > Node 1, zone   Normal    194   1210   2080   1800    715    255    111

> > > > > > > > > > >     56     42     36     55

> > > > > > > > > > > Node 2, zone   Normal    204    768   3766   3394   1742    468    185

> > > > > > > > > > >    194    238     47     74

> > > > > > > > > > > Node 3, zone   Normal   1622   2137   1058    846    388    208     97

> > > > > > > > > > >     44     14     42     10

> > > > > > > > > >

> > > > > > > > > > Again, thinking out loud: there is only one zone on node 3, i.e., the

> > > > > > > > > > normal zone, and this excludes the problem commit

> > > > > > > > > > 669281ee7ef731fb5204df9d948669bf32a5e68d ("Multi-gen LRU: fix per-zone

> > > > > > > > > > reclaim") fixed in v6.6.

> > > > > > > > >

> > > > > > > > > I built vanila 6.6.1 and did the first fast test - spin up and destroy

> > > > > > > > > VMs only - This test does not always trigger the kswapd3 continuous

> > > > > > > > > swap in/out  usage but it uses it and it  looks like there is a

> > > > > > > > > change:

> > > > > > > > >

> > > > > > > > >  I can see kswapd non-continous (15s and more) usage with 6.5.y

> > > > > > > > >  # ps ax | grep [k]swapd

> > > > > > > > >     753 ?        S      0:00 [kswapd0]

> > > > > > > > >     754 ?        S      0:00 [kswapd1]

> > > > > > > > >     755 ?        S      0:00 [kswapd2]

> > > > > > > > >     756 ?        S      0:15 [kswapd3]    <<<<<<<<<

> > > > > > > > >     757 ?        S      0:00 [kswapd4]

> > > > > > > > >     758 ?        S      0:00 [kswapd5]

> > > > > > > > >     759 ?        S      0:00 [kswapd6]

> > > > > > > > >     760 ?        S      0:00 [kswapd7]

> > > > > > > > >     761 ?        S      0:00 [kswapd8]

> > > > > > > > >     762 ?        S      0:00 [kswapd9]

> > > > > > > > >     763 ?        S      0:00 [kswapd10]

> > > > > > > > >     764 ?        S      0:00 [kswapd11]

> > > > > > > > >     765 ?        S      0:00 [kswapd12]

> > > > > > > > >     766 ?        S      0:00 [kswapd13]

> > > > > > > > >     767 ?        S      0:00 [kswapd14]

> > > > > > > > >     768 ?        S      0:00 [kswapd15]

> > > > > > > > >

> > > > > > > > > and none kswapd usage with 6.6.1, that looks to be promising path

> > > > > > > > >

> > > > > > > > > # ps ax | grep [k]swapd

> > > > > > > > >     808 ?        S      0:00 [kswapd0]

> > > > > > > > >     809 ?        S      0:00 [kswapd1]

> > > > > > > > >     810 ?        S      0:00 [kswapd2]

> > > > > > > > >     811 ?        S      0:00 [kswapd3]    <<<< nice

> > > > > > > > >     812 ?        S      0:00 [kswapd4]

> > > > > > > > >     813 ?        S      0:00 [kswapd5]

> > > > > > > > >     814 ?        S      0:00 [kswapd6]

> > > > > > > > >     815 ?        S      0:00 [kswapd7]

> > > > > > > > >     816 ?        S      0:00 [kswapd8]

> > > > > > > > >     817 ?        S      0:00 [kswapd9]

> > > > > > > > >     818 ?        S      0:00 [kswapd10]

> > > > > > > > >     819 ?        S      0:00 [kswapd11]

> > > > > > > > >     820 ?        S      0:00 [kswapd12]

> > > > > > > > >     821 ?        S      0:00 [kswapd13]

> > > > > > > > >     822 ?        S      0:00 [kswapd14]

> > > > > > > > >     823 ?        S      0:00 [kswapd15]

> > > > > > > > >

> > > > > > > > > I will install the 6.6.1 on the server which is doing some work and

> > > > > > > > > observe it later today.

> > > > > > > >

> > > > > > > > Thanks. Fingers crossed.

> > > > > > >

> > > > > > > The 6.6.y was deployed and used from 9th Nov 3PM CEST. So far so good.

> > > > > > > The node 3 has 163MiB free of memory and I see

> > > > > > > just a few in/out swap usage sometimes (which is expected) and minimal

> > > > > > > kswapd3 process usage for almost 4days.

> > > > > >

> > > > > > Thanks for the update!

> > > > > >

> > > > > > Just to confirm:

> > > > > > 1. MGLRU was enabled, and

> > > > >

> > > > > Yes, MGLRU is enabled

> > > > >

> > > > > > 2. The v6.6 deployed did NOT have the patch I attached earlier.

> > > > >

> > > > > Vanila 6.6, attached patch NOT applied.

> > > > >

> > > > > > Are both correct?

> > > > > >

> > > > > > If so, I'd very appreciate it if you could try the attached patch on

> > > > > > top of v6.5 and see if it helps. My suspicion is that the problem is

> > > > > > compaction related, i.e., kswapd was woken up by high order

> > > > > > allocations but didn't properly stop. But what causes the behavior

> > > > >

> > > > > Sure, I can try it. Will inform you about progress.

> > > >

> > > > Thanks!

> > > >

> > > > > > difference on v6.5 between MGLRU and the active/inactive LRU still

> > > > > > puzzles me --the problem might be somehow masked rather than fixed on

> > > > > > v6.6.

> > > > >

> > > > > I'm not sure how I can help with the issue. Any suggestions on what to

> > > > > change/try?

> > > >

> > > > Trying the attached patch is good enough for now :)

> > >

> > > So far I'm running the "6.5.y + patch" for 4 days without triggering

> > > the infinite swap in//out usage.

> > >

> > > I'm observing a similar pattern in kswapd usage - "if it uses kswapd,

> > > then it is in majority the kswapd3 - like the vanila 6.5.y which is

> > > not observed with 6.6.y, (The Node's 3 free mem is 159 MB)

> > > # ps ax | grep [k]swapd

> > >     750 ?        S      0:00 [kswapd0]

> > >     751 ?        S      0:00 [kswapd1]

> > >     752 ?        S      0:00 [kswapd2]

> > >     753 ?        S      0:02 [kswapd3]    <<<< it uses kswapd3, good

> > > is that it is not continuous

> > >     754 ?        S      0:00 [kswapd4]

> > >     755 ?        S      0:00 [kswapd5]

> > >     756 ?        S      0:00 [kswapd6]

> > >     757 ?        S      0:00 [kswapd7]

> > >     758 ?        S      0:00 [kswapd8]

> > >     759 ?        S      0:00 [kswapd9]

> > >     760 ?        S      0:00 [kswapd10]

> > >     761 ?        S      0:00 [kswapd11]

> > >     762 ?        S      0:00 [kswapd12]

> > >     763 ?        S      0:00 [kswapd13]

> > >     764 ?        S      0:00 [kswapd14]

> > >     765 ?        S      0:00 [kswapd15]

> > >

> > > Good stuff is that the system did not end in a continuous loop of swap

> > > in/out usage (at least so far) which is great. See attached

> > > swap_in_out_good_vs_bad.png. I will keep it running for the next 3

> > > days.

> >

> > Thanks again, Jaroslav!

> >

> > Just a note here: I suspect the problem still exists on v6.6 but

> > somehow is masked, possibly by reduced memory usage from the kernel

> > itself and more free memory for userspace. So to be on the safe side,

> > I'll post the patch and credit you as the reporter and tester.

>

> Morning, let's wait. I reviewed the graph and the swap in/out started

> to be happening from 1:50 AM CET. Slower than before (util of cpu

> 0.3%) but it is doing in/out see attached png.

I investigated it more, there was an operation issue and the system

disabled multi-gen lru yesterday ~10 AM CET (our temporary workaround

for this problem) by

   echo N > /sys/kernel/mm/lru_gen/enabled

when an alert was triggered by an unexpected setup of the server.

Could it be that the patch is not functional if lru_gen/enabled is

0x0000?

That’s correct.

I need to reboot the system and do the whole week's test again.

Thanks a lot!