Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU

Yu Zhao <yuzhao@xxxxxxxxxx> · Wed, 8 Nov 2023 14:09:58 -0800

On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
<jaroslav.pulchart@xxxxxxxxxxxx> wrote:
>
> >
> > Hi Jaroslav,
>
> Hi Yu Zhao
>
> thanks for response, see answers inline:
>
> >
> > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
> > <jaroslav.pulchart@xxxxxxxxxxxx> wrote:
> > >
> > > Hello,
> > >
> > > I would like to report to you an unpleasant behavior of multi-gen LRU
> > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> > > system (16numa domains).
> >
> > Kernel version please?
>
> 6.5.y, but we saw it sooner as it is in investigation from 23th May
> (6.4.y and maybe even the 6.3.y).

v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
for you if you run into other problems with v6.6.

> > > Symptoms of my issue are
> > >
> > > /A/ if mult-gen LRU is enabled
> > > 1/ [kswapd3] is consuming 100% CPU
> >
> > Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
> >
> > >     top - 15:03:11 up 34 days,  1:51,  2 users,  load average: 23.34,
> > > 18.26, 15.01
> > >     Tasks: 1226 total,   2 running, 1224 sleeping,   0 stopped,   0 zombie
> > >     %Cpu(s): 12.5 us,  4.7 sy,  0.0 ni, 82.1 id,  0.0 wa,  0.4 hi,
> > > 0.4 si,  0.0 st
> > >     MiB Mem : 1047265.+total,  28382.7 free, 1021308.+used,    767.6 buff/cache
> > >     MiB Swap:   8192.0 total,   8187.7 free,      4.2 used.  25956.7 avail Mem
> > >     ...
> > >         765 root      20   0       0      0      0 R  98.3   0.0
> > > 34969:04 kswapd3
> > >     ...
> > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> > > observed with swap disk as well and cause IO latency issues due to
> > > some kind of locking)
> > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> > >
> > >
> > > /B/ if mult-gen LRU is disabled
> > > 1/ [kswapd3] is consuming 3%-10% CPU
> > >     top - 15:02:49 up 34 days,  1:51,  2 users,  load average: 23.05,
> > > 17.77, 14.77
> > >     Tasks: 1226 total,   1 running, 1225 sleeping,   0 stopped,   0 zombie
> > >     %Cpu(s): 14.7 us,  2.8 sy,  0.0 ni, 81.8 id,  0.0 wa,  0.4 hi,
> > > 0.4 si,  0.0 st
> > >     MiB Mem : 1047265.+total,  28378.5 free, 1021313.+used,    767.3 buff/cache
> > >     MiB Swap:   8192.0 total,   8189.0 free,      3.0 used.  25952.4 avail Mem
> > >     ...
> > >        765 root      20   0       0      0      0 S   3.6   0.0
> > > 34966:46 [kswapd3]
> > >     ...
> > > 2/ swap space usage is low (4MB)
> > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
> > >
> > > Both situations are wrong as they are using swap in/out extensively,
> > > however the multi-gen LRU situation is 10times worse.
> >
> > From the stats below, node 3 had the lowest free memory. So I think in
> > both cases, the reclaim activities were as expected.
>
> I do not see a reason for the memory pressure and reclaims. This node
> has the lowest free memory of all nodes (~302MB free) that is true,
> however the swap space usage is just 4MB (still going in and out). So
> what can be the reason for that behaviour?

The best analogy is that refuel (reclaim) happens before the tank
becomes empty, and it happens even sooner when there is a long road
ahead (high order allocations).

> The workers/application is running in pre-allocated HugePages and the
> rest is used for a small set of system services and drivers of
> devices. It is static and not growing. The issue persists when I stop
> the system services and free the memory.

Yes, this helps. Also could you attach /proc/buddyinfo from the moment
you hit the problem?

> > > Could I ask for any suggestions on how to avoid the kswapd utilization
> > > pattern?
> >
> > The easiest way is to disable NUMA domain so that there would be only
> > two nodes with 8x more memory. IOW, you have fewer pools but each pool
> > has more memory and therefore they are less likely to become empty.
> >
> > > There is a free RAM in each numa node for the few MB used in
> > > swap:
> > >     NUMA stats:
> > >     NUMA nodes: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> > >     MemTotal: 65048 65486 65486 65486 65486 65486 65486 65469 65486
> > > 65486 65486 65486 65486 65486 65486 65424
> > >     MemFree: 468 601 1200 302 548 1879 2321 2478 1967 2239 1453 2417
> > > 2623 2833 2530 2269
> > > the in/out usage does not make sense for me nor the CPU utilization by
> > > multi-gen LRU.
> >
> > My questions:
> > 1. Were there any OOM kills with either case?
>
> There is no OOM. The memory usage is not growing nor the swap space
> usage, it is still a few MB there.
>
> > 2. Was THP enabled?
>
> Both situations with enabled and with disabled THP.

My suspicion is that you packed the node 3 too perfectly :) And that
might have triggered a known but currently a low priority problem in
MGLRU. I'm attaching a patch for v6.6 and hoping you could verify it
for me in case v6.6 by itself still has the problem?

> > MGLRU might have spent the extra CPU cycles just to void OOM kills or
> > produce more THPs.
> >
> > If disabling the NUMA domain isn't an option, I'd recommend:
>
> Disabling numa is not an option. However we are now testing a setup
> with -1GB in HugePages per each numa.
>
> > 1. Try the latest kernel (6.6.1) if you haven't.
>
> Not yet, the 6.6.1 was released today.
>
> > 2. Disable THP if it was enabled, to verify whether it has an impact.
>
> I try disabling THP without any effect.

Gochat. Please try the patch with MGLRU and let me know. Thanks!

(Also CC Charan @ Qualcomm who initially reported the problem that
ended up with the attached patch.)
Attachment:
0001-mm-mglru-curb-kswapd-overshooting-high-wmarks.patch

Description: Binary data