Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU

Yu Zhao <yuzhao@xxxxxxxxxx> · Tue, 21 Nov 2023 23:13:35 -0700

On Mon, Nov 20, 2023 at 1:42 AM Jaroslav Pulchart
<jaroslav.pulchart@xxxxxxxxxxxx> wrote:
>
> > On Tue, Nov 14, 2023 at 12:30 AM Jaroslav Pulchart
> > <jaroslav.pulchart@xxxxxxxxxxxx> wrote:
> > >
> > > >
> > > > On Mon, Nov 13, 2023 at 1:36 AM Jaroslav Pulchart
> > > > <jaroslav.pulchart@xxxxxxxxxxxx> wrote:
> > > > >
> > > > > >
> > > > > > On Thu, Nov 9, 2023 at 3:58 AM Jaroslav Pulchart
> > > > > > <jaroslav.pulchart@xxxxxxxxxxxx> wrote:
> > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Nov 8, 2023 at 10:39 PM Jaroslav Pulchart
> > > > > > > > <jaroslav.pulchart@xxxxxxxxxxxx> wrote:
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
> > > > > > > > > > <jaroslav.pulchart@xxxxxxxxxxxx> wrote:
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Hi Jaroslav,
> > > > > > > > > > >
> > > > > > > > > > > Hi Yu Zhao
> > > > > > > > > > >
> > > > > > > > > > > thanks for response, see answers inline:
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
> > > > > > > > > > > > <jaroslav.pulchart@xxxxxxxxxxxx> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hello,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I would like to report to you an unpleasant behavior of multi-gen LRU
> > > > > > > > > > > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> > > > > > > > > > > > > system (16numa domains).
> > > > > > > > > > > >
> > > > > > > > > > > > Kernel version please?
> > > > > > > > > > >
> > > > > > > > > > > 6.5.y, but we saw it sooner as it is in investigation from 23th May
> > > > > > > > > > > (6.4.y and maybe even the 6.3.y).
> > > > > > > > > >
> > > > > > > > > > v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
> > > > > > > > > > for you if you run into other problems with v6.6.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I will give it a try using 6.6.y. When it will work we can switch to
> > > > > > > > > 6.6.y instead of backporting the stuff to 6.5.y.
> > > > > > > > >
> > > > > > > > > > > > > Symptoms of my issue are
> > > > > > > > > > > > >
> > > > > > > > > > > > > /A/ if mult-gen LRU is enabled
> > > > > > > > > > > > > 1/ [kswapd3] is consuming 100% CPU
> > > > > > > > > > > >
> > > > > > > > > > > > Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
> > > > > > > > > > > >
> > > > > > > > > > > > >     top - 15:03:11 up 34 days,  1:51,  2 users,  load average: 23.34,
> > > > > > > > > > > > > 18.26, 15.01
> > > > > > > > > > > > >     Tasks: 1226 total,   2 running, 1224 sleeping,   0 stopped,   0 zombie
> > > > > > > > > > > > >     %Cpu(s): 12.5 us,  4.7 sy,  0.0 ni, 82.1 id,  0.0 wa,  0.4 hi,
> > > > > > > > > > > > > 0.4 si,  0.0 st
> > > > > > > > > > > > >     MiB Mem : 1047265.+total,  28382.7 free, 1021308.+used,    767.6 buff/cache
> > > > > > > > > > > > >     MiB Swap:   8192.0 total,   8187.7 free,      4.2 used.  25956.7 avail Mem
> > > > > > > > > > > > >     ...
> > > > > > > > > > > > >         765 root      20   0       0      0      0 R  98.3   0.0
> > > > > > > > > > > > > 34969:04 kswapd3
> > > > > > > > > > > > >     ...
> > > > > > > > > > > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> > > > > > > > > > > > > observed with swap disk as well and cause IO latency issues due to
> > > > > > > > > > > > > some kind of locking)
> > > > > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > /B/ if mult-gen LRU is disabled
> > > > > > > > > > > > > 1/ [kswapd3] is consuming 3%-10% CPU
> > > > > > > > > > > > >     top - 15:02:49 up 34 days,  1:51,  2 users,  load average: 23.05,
> > > > > > > > > > > > > 17.77, 14.77
> > > > > > > > > > > > >     Tasks: 1226 total,   1 running, 1225 sleeping,   0 stopped,   0 zombie
> > > > > > > > > > > > >     %Cpu(s): 14.7 us,  2.8 sy,  0.0 ni, 81.8 id,  0.0 wa,  0.4 hi,
> > > > > > > > > > > > > 0.4 si,  0.0 st
> > > > > > > > > > > > >     MiB Mem : 1047265.+total,  28378.5 free, 1021313.+used,    767.3 buff/cache
> > > > > > > > > > > > >     MiB Swap:   8192.0 total,   8189.0 free,      3.0 used.  25952.4 avail Mem
> > > > > > > > > > > > >     ...
> > > > > > > > > > > > >        765 root      20   0       0      0      0 S   3.6   0.0
> > > > > > > > > > > > > 34966:46 [kswapd3]
> > > > > > > > > > > > >     ...
> > > > > > > > > > > > > 2/ swap space usage is low (4MB)
> > > > > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
> > > > > > > > > > > > >
> > > > > > > > > > > > > Both situations are wrong as they are using swap in/out extensively,
> > > > > > > > > > > > > however the multi-gen LRU situation is 10times worse.
> > > > > > > > > > > >
> > > > > > > > > > > > From the stats below, node 3 had the lowest free memory. So I think in
> > > > > > > > > > > > both cases, the reclaim activities were as expected.
> > > > > > > > > > >
> > > > > > > > > > > I do not see a reason for the memory pressure and reclaims. This node
> > > > > > > > > > > has the lowest free memory of all nodes (~302MB free) that is true,
> > > > > > > > > > > however the swap space usage is just 4MB (still going in and out). So
> > > > > > > > > > > what can be the reason for that behaviour?
> > > > > > > > > >
> > > > > > > > > > The best analogy is that refuel (reclaim) happens before the tank
> > > > > > > > > > becomes empty, and it happens even sooner when there is a long road
> > > > > > > > > > ahead (high order allocations).
> > > > > > > > > >
> > > > > > > > > > > The workers/application is running in pre-allocated HugePages and the
> > > > > > > > > > > rest is used for a small set of system services and drivers of
> > > > > > > > > > > devices. It is static and not growing. The issue persists when I stop
> > > > > > > > > > > the system services and free the memory.
> > > > > > > > > >
> > > > > > > > > > Yes, this helps.
> > > > > > > > > >  Also could you attach /proc/buddyinfo from the moment
> > > > > > > > > > you hit the problem?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I can. The problem is continuous, it is 100% of time continuously
> > > > > > > > > doing in/out and consuming 100% of CPU and locking IO.
> > > > > > > > >
> > > > > > > > > The output of /proc/buddyinfo is:
> > > > > > > > >
> > > > > > > > > # cat /proc/buddyinfo
> > > > > > > > > Node 0, zone      DMA      7      2      2      1      1      2      1
> > > > > > > > >      1      1      2      1
> > > > > > > > > Node 0, zone    DMA32   4567   3395   1357    846    439    190     93
> > > > > > > > >     61     43     23      4
> > > > > > > > > Node 0, zone   Normal     19    190    140    129    136     75     66
> > > > > > > > >     41      9      1      5
> > > > > > > > > Node 1, zone   Normal    194   1210   2080   1800    715    255    111
> > > > > > > > >     56     42     36     55
> > > > > > > > > Node 2, zone   Normal    204    768   3766   3394   1742    468    185
> > > > > > > > >    194    238     47     74
> > > > > > > > > Node 3, zone   Normal   1622   2137   1058    846    388    208     97
> > > > > > > > >     44     14     42     10
> > > > > > > >
> > > > > > > > Again, thinking out loud: there is only one zone on node 3, i.e., the
> > > > > > > > normal zone, and this excludes the problem commit
> > > > > > > > 669281ee7ef731fb5204df9d948669bf32a5e68d ("Multi-gen LRU: fix per-zone
> > > > > > > > reclaim") fixed in v6.6.
> > > > > > >
> > > > > > > I built vanila 6.6.1 and did the first fast test - spin up and destroy
> > > > > > > VMs only - This test does not always trigger the kswapd3 continuous
> > > > > > > swap in/out  usage but it uses it and it  looks like there is a
> > > > > > > change:
> > > > > > >
> > > > > > >  I can see kswapd non-continous (15s and more) usage with 6.5.y
> > > > > > >  # ps ax | grep [k]swapd
> > > > > > >     753 ?        S      0:00 [kswapd0]
> > > > > > >     754 ?        S      0:00 [kswapd1]
> > > > > > >     755 ?        S      0:00 [kswapd2]
> > > > > > >     756 ?        S      0:15 [kswapd3]    <<<<<<<<<
> > > > > > >     757 ?        S      0:00 [kswapd4]
> > > > > > >     758 ?        S      0:00 [kswapd5]
> > > > > > >     759 ?        S      0:00 [kswapd6]
> > > > > > >     760 ?        S      0:00 [kswapd7]
> > > > > > >     761 ?        S      0:00 [kswapd8]
> > > > > > >     762 ?        S      0:00 [kswapd9]
> > > > > > >     763 ?        S      0:00 [kswapd10]
> > > > > > >     764 ?        S      0:00 [kswapd11]
> > > > > > >     765 ?        S      0:00 [kswapd12]
> > > > > > >     766 ?        S      0:00 [kswapd13]
> > > > > > >     767 ?        S      0:00 [kswapd14]
> > > > > > >     768 ?        S      0:00 [kswapd15]
> > > > > > >
> > > > > > > and none kswapd usage with 6.6.1, that looks to be promising path
> > > > > > >
> > > > > > > # ps ax | grep [k]swapd
> > > > > > >     808 ?        S      0:00 [kswapd0]
> > > > > > >     809 ?        S      0:00 [kswapd1]
> > > > > > >     810 ?        S      0:00 [kswapd2]
> > > > > > >     811 ?        S      0:00 [kswapd3]    <<<< nice
> > > > > > >     812 ?        S      0:00 [kswapd4]
> > > > > > >     813 ?        S      0:00 [kswapd5]
> > > > > > >     814 ?        S      0:00 [kswapd6]
> > > > > > >     815 ?        S      0:00 [kswapd7]
> > > > > > >     816 ?        S      0:00 [kswapd8]
> > > > > > >     817 ?        S      0:00 [kswapd9]
> > > > > > >     818 ?        S      0:00 [kswapd10]
> > > > > > >     819 ?        S      0:00 [kswapd11]
> > > > > > >     820 ?        S      0:00 [kswapd12]
> > > > > > >     821 ?        S      0:00 [kswapd13]
> > > > > > >     822 ?        S      0:00 [kswapd14]
> > > > > > >     823 ?        S      0:00 [kswapd15]
> > > > > > >
> > > > > > > I will install the 6.6.1 on the server which is doing some work and
> > > > > > > observe it later today.
> > > > > >
> > > > > > Thanks. Fingers crossed.
> > > > >
> > > > > The 6.6.y was deployed and used from 9th Nov 3PM CEST. So far so good.
> > > > > The node 3 has 163MiB free of memory and I see
> > > > > just a few in/out swap usage sometimes (which is expected) and minimal
> > > > > kswapd3 process usage for almost 4days.
> > > >
> > > > Thanks for the update!
> > > >
> > > > Just to confirm:
> > > > 1. MGLRU was enabled, and
> > >
> > > Yes, MGLRU is enabled
> > >
> > > > 2. The v6.6 deployed did NOT have the patch I attached earlier.
> > >
> > > Vanila 6.6, attached patch NOT applied.
> > >
> > > > Are both correct?
> > > >
> > > > If so, I'd very appreciate it if you could try the attached patch on
> > > > top of v6.5 and see if it helps. My suspicion is that the problem is
> > > > compaction related, i.e., kswapd was woken up by high order
> > > > allocations but didn't properly stop. But what causes the behavior
> > >
> > > Sure, I can try it. Will inform you about progress.
> >
> > Thanks!
> >
> > > > difference on v6.5 between MGLRU and the active/inactive LRU still
> > > > puzzles me --the problem might be somehow masked rather than fixed on
> > > > v6.6.
> > >
> > > I'm not sure how I can help with the issue. Any suggestions on what to
> > > change/try?
> >
> > Trying the attached patch is good enough for now :)
>
> So far I'm running the "6.5.y + patch" for 4 days without triggering
> the infinite swap in//out usage.
>
> I'm observing a similar pattern in kswapd usage - "if it uses kswapd,
> then it is in majority the kswapd3 - like the vanila 6.5.y which is
> not observed with 6.6.y, (The Node's 3 free mem is 159 MB)
> # ps ax | grep [k]swapd
>     750 ?        S      0:00 [kswapd0]
>     751 ?        S      0:00 [kswapd1]
>     752 ?        S      0:00 [kswapd2]
>     753 ?        S      0:02 [kswapd3]    <<<< it uses kswapd3, good
> is that it is not continuous
>     754 ?        S      0:00 [kswapd4]
>     755 ?        S      0:00 [kswapd5]
>     756 ?        S      0:00 [kswapd6]
>     757 ?        S      0:00 [kswapd7]
>     758 ?        S      0:00 [kswapd8]
>     759 ?        S      0:00 [kswapd9]
>     760 ?        S      0:00 [kswapd10]
>     761 ?        S      0:00 [kswapd11]
>     762 ?        S      0:00 [kswapd12]
>     763 ?        S      0:00 [kswapd13]
>     764 ?        S      0:00 [kswapd14]
>     765 ?        S      0:00 [kswapd15]
>
> Good stuff is that the system did not end in a continuous loop of swap
> in/out usage (at least so far) which is great. See attached
> swap_in_out_good_vs_bad.png. I will keep it running for the next 3
> days.

Thanks again, Jaroslav!

Just a note here: I suspect the problem still exists on v6.6 but
somehow is masked, possibly by reduced memory usage from the kernel
itself and more free memory for userspace. So to be on the safe side,
I'll post the patch and credit you as the reporter and tester.