On 4/20/22 14:44, Johannes Weiner wrote: >> >> The larger issue is that our workload has regressed in performance. >> >> With V2 and swappiness=10 we are still seeing some swap, but very little tearing >> down of THPs over time. With swappiness=0 it did some when swap but we are not >> losings GBs of THPS (with your patch swappiness=0 has swap or THP issues on V2). I meant to say `with your patch swappiness=0 does not swap or have thp issues on v2` >> >> With V1 and swappiness=(0|10)(with and without your patch), it swaps a ton and >> ultimately leads to a significant amount of THP splitting. So the longer the >> system/workload runs, the less likely we are to get THPs backing the guest and >> the performance gain from THPs is lost. > > I hate to ask, but is it possible this is a configuration issue Im very glad you asked :) > > One significant difference between V1 and V2 is that V1 has per-cgroup > swappiness, which is inherited when the cgroup is created. So if you > set sysctl vm.swappiness=0 after cgroups have been created, it will > not update them. V2 cgroups do use vm.swappiness: This is something I did not consider... Thank you for pointing that out! The issue still occurs weather or not I set the swappiness value before the VM boot. However this led me to find the icing on the cake :) Even if I set vm.swappiness=0 at boot using sysctl.conf I was not considering the fact that libvirtd was creating its own cgroup for the machines you start it with... additionally it does not inherit the sysctl value (even when set at boot)?!? How annoying... The cgroups swappiness value is defaulted to 60. This to me seems wrong from a libvirt/systemd POV. If the system is booted with swappiness=0 then why does the (user|machine|system).splice cgroup ignore this value when it creates it cgroups (see below). I will have to dig a little further to find a cause/fix for this. This requires the libvirt users to understand a number of intricacies that they really shouldnt have to consider, and may lead to headaches like these ;P Values of the memcgs created on boot (with sysctl.swappiness=0 on V1 boot) ------------------------------------------------------------------------ /sys/fs/cgroup/memory/memory.swappiness =0 /sys/fs/cgroup/memory/dev-hugepages.mount/memory.swappiness =60 /sys/fs/cgroup/memory/dev-mqueue.mount/memory.swappiness =60 /sys/fs/cgroup/memory/machine.slice/memory.swappiness =60 /sys/fs/cgroup/memory/proc-sys-fs-binfmt_misc.mount/memory.swappiness =0 /sys/fs/cgroup/memory/sys-fs-fuse-connections.mount/memory.swappiness =0 /sys/fs/cgroup/memory/sys-kernel-config.mount/memory.swappiness =0 /sys/fs/cgroup/memory/sys-kernel-debug.mount/memory.swappiness =60 /sys/fs/cgroup/memory/sys-kernel-tracing.mount/memory.swappiness =60 /sys/fs/cgroup/memory/system.slice/memory.swappiness =60 /sys/fs/cgroup/memory/user.slice/memory.swappiness =60 Some seem to inherit the cgroup/memory/memory.swappiness value and some do not... This issue was brought up in a systemd issue with no solution or documentation [1]. Libvirt in particular is using the machine.splice cgroup so it inherits the 60. If i change that value to 0, then start the machine it now has swappiness 0. $ echo 0 > /sys/fs/cgroup/memory/machine.slice/memory.swappiness $ virsh start <guest-name> $ cat /sys/fs/cgroup/memory/machine.slice/machine-qemu.scope/memory.swappiness 0 Thank you so much for your very insightful note that led to the real issue :) > Thanks for verifying. I'll prepare a proper patch. my issue with v1 vs v2 seems to go away with a much more sane value of swappiness=10 on v1 (when actually set properly lol). Also as per my results below, I actually dont think your patch caused much change to my workload. Im not sure what happened the first time I ran it that caused the swapping on v2 (before your patch)... perhaps I ran the older kernel (~v5.14) that was still having issues with v2 or its the fact that the results can differ between runs. sorry about that. Here is the test results for your patch with V1 and V2 (swappiness=0/10): Before Patch ------------- -- V1(swappiness=0): total used free shared buff/cache available Mem: 264071432 257465704 1100160 4224 5505568 5064964 Swap: 4194300 47828 4146472 Node 0 AnonPages: 128068580 kB Node 1 AnonPages: 128120400 kB Node 0 AnonHugePages: 128012288 kB Node 1 AnonHugePages: 128090112 kB ^^^^^ no loss -- V1(swappiness=10): total used free shared buff/cache available Mem: 264071432 257364436 972048 3972 5734948 5164520 Swap: 4194300 235028 3959272 Node 0 AnonPages: 128015404 kB Node 1 AnonPages: 128002788 kB Node 0 AnonHugePages: 128002048 kB Node 1 AnonHugePages: 120576000 kB ^^^^^ some loss -- V2(swappiness=0): total used free shared buff/cache available Mem: 264071432 257609352 924692 4664 5537388 4921236 Swap: 4194300 0 4194300 ^^^^^ No Swap Node 0 AnonPages: 128083104 kB Node 1 AnonPages: 128180576 kB Node 0 AnonHugePages: 128002048 kB Node 1 AnonHugePages: 128124928 kB ^^^^^ No loss -- V2(swappiness=10): total used free shared buff/cache available Mem: 264071432 257407576 918124 4632 5745732 5101764 Swap: 4194300 220424 3973876 ^^^^^ Some Swap Node 0 AnonPages: 128109700 kB Node 1 AnonPages: 127918164 kB Node 0 AnonHugePages: 128006144 kB Node 1 AnonHugePages: 120569856 kB ^^^^^ some loss After Patch ------------- -- V1:swappiness=0 total used free shared buff/cache available Mem: 264071432 257538832 945276 4368 5587324 4991852 Swap: 4194300 9276 4185024 Node 0 AnonPages: 128133932 kB Node 1 AnonPages: 128100540 kB Node 0 AnonHugePages: 128047104 kB Node 1 AnonHugePages: 128061440 kB -- V1:swappiness=10 total used free shared buff/cache available Mem: 264071432 257428564 969252 4384 5673616 5100824 Swap: 4194300 138936 4055364 ^^^^^ Some Swap Node 0 AnonPages: 128161724 kB Node 1 AnonPages: 127945368 kB Node 0 AnonHugePages: 128043008 kB Node 1 AnonHugePages: 120221696 kB ^^^^^ some loss -- V2(swappiness=0): total used free shared buff/cache available Mem: 264071432 257536896 927424 4664 5607112 4993184 Swap: 4194300 0 4194300 Node 0 AnonPages: 128145476 kB Node 1 AnonPages: 128111908 kB Node 0 AnonHugePages: 128026624 kB Node 1 AnonHugePages: 128090112 kB -- V2(swappiness=10): total used free shared buff/cache available Mem: 264071432 257423936 1007076 4548 5640420 5106544 Swap: 4194300 156016 4038284 Node 0 AnonPages: 128133264 kB Node 1 AnonPages: 127955952 kB Node 0 AnonHugePages: 128018432 kB Node 1 AnonHugePages: 122507264 kB ^^^^ slightly better The only notable difference between before/after your patch is that with your patch the THP tearing was slightly better, resulting in an extra 2GB as seen in the last result. This may just be noise. I'll have to see if I can find a fix for this in either the kernel, libvirt, or systemd, and will follow up if I do. If not this should at least be documented correctly. Given the fact cgroupV1 is in limited support mode upstream, and systemd's hesitancy to make changes for V1, we may how to go down our own avenues to ensure our customers dont run into this issue. Big Thanks! -- Nico [1] - https://github.com/systemd/systemd/issues/9276