Re: [PATCH v3] vm_swappiness=0 should still try to avoid swapping anon memory

Nico Pache <npache@xxxxxxxxxx> · Wed, 20 Apr 2022 13:34:58 -0400

On 4/20/22 10:01, Johannes Weiner wrote:
>> My swappiness=0 solution was a minimal approach to regaining the 'avoid swapping
>> ANON' behavior that was previously there, but as Shakeel pointed out, there may
>> be something larger at play.
> 
> So with my patch and swappiness=0 you get excessive swapping on v1 but
> not on v2? And the patch to avoid DEACTIVATE_ANON fixes it?

correct, I haven't tested the DEACTIVATE_ANON patch since last time I was
working on this, but it did cure it. I can build a new kernel with it and verify
again.

The larger issue is that our workload has regressed in performance.

With V2 and swappiness=10 we are still seeing some swap, but very little tearing
down of THPs over time. With swappiness=0 it did some when swap but we are not
losings GBs of THPS (with your patch swappiness=0 has swap or THP issues on V2).

With V1 and swappiness=(0|10)(with and without your patch), it swaps a ton and
ultimately leads to a significant amount of THP splitting. So the longer the
system/workload runs, the less likely we are to get THPs backing the guest and
the performance gain from THPs is lost.

So your patch does help return the old swappiness=0 behavior, but only for V2.

Ideally we would like to keep swappiness>0 but I found that with my patch and
swappiness=0 we could create a workaround for this effect on V1, but any other
value still results in the THP issue.

After the workload is run with V2 and swappiness=0 the host system look like this**:
               total        used        free      shared  buff/cache   available
Mem:       264071432   257536896      927424        4664     5607112     4993184
Swap:        4194300           0     4194300

Node 0 AnonPages:      128145476 kB	Node 1 AnonPages:      128111908 kB
Node 0 AnonHugePages:  128026624 kB	Node 1 AnonHugePages:  128090112 kB

** without your patch there is still some swap and THP splitting but nothing
like the case below.

Same workload on V1/swappiness=0 looks like this:
               total        used        free	  shared  buff/cache   available
Mem:	   264071432   257169500     1032612        4192     5869320     5357944
Swap:        4194300      623008     3571292

Node 0 AnonPages:      127927156 kB     Node 1 AnonPages:      127701088 kB
Node 0 AnonHugePages:  127789056 kB     Node 1 AnonHugePages:  87552000 kB
								^^^^^^^

This leads to the performance regression I'm referring to in later workloads.
V2 used to have a similar effect to V1, but not nearly as bad. Recent updates
upstream fixed this in V2.

The workload tests multiple FS types so this is most likely not a FS specific
issue either.

> If you haven't done so, it could be useful to litter shrink_node() and
> get_scan_count() with trace_printk() to try to make sense of all the
> decisions that result in it swapping.
Will do :) I was originally doing some BPF tracing that lead me to find the
DEACTIVE_ANON case.

Thanks,
-- Nico