On 4/20/22 10:01, Johannes Weiner wrote: >> My swappiness=0 solution was a minimal approach to regaining the 'avoid swapping >> ANON' behavior that was previously there, but as Shakeel pointed out, there may >> be something larger at play. > > So with my patch and swappiness=0 you get excessive swapping on v1 but > not on v2? And the patch to avoid DEACTIVATE_ANON fixes it? correct, I haven't tested the DEACTIVATE_ANON patch since last time I was working on this, but it did cure it. I can build a new kernel with it and verify again. The larger issue is that our workload has regressed in performance. With V2 and swappiness=10 we are still seeing some swap, but very little tearing down of THPs over time. With swappiness=0 it did some when swap but we are not losings GBs of THPS (with your patch swappiness=0 has swap or THP issues on V2). With V1 and swappiness=(0|10)(with and without your patch), it swaps a ton and ultimately leads to a significant amount of THP splitting. So the longer the system/workload runs, the less likely we are to get THPs backing the guest and the performance gain from THPs is lost. So your patch does help return the old swappiness=0 behavior, but only for V2. Ideally we would like to keep swappiness>0 but I found that with my patch and swappiness=0 we could create a workaround for this effect on V1, but any other value still results in the THP issue. After the workload is run with V2 and swappiness=0 the host system look like this**: total used free shared buff/cache available Mem: 264071432 257536896 927424 4664 5607112 4993184 Swap: 4194300 0 4194300 Node 0 AnonPages: 128145476 kB Node 1 AnonPages: 128111908 kB Node 0 AnonHugePages: 128026624 kB Node 1 AnonHugePages: 128090112 kB ** without your patch there is still some swap and THP splitting but nothing like the case below. Same workload on V1/swappiness=0 looks like this: total used free shared buff/cache available Mem: 264071432 257169500 1032612 4192 5869320 5357944 Swap: 4194300 623008 3571292 Node 0 AnonPages: 127927156 kB Node 1 AnonPages: 127701088 kB Node 0 AnonHugePages: 127789056 kB Node 1 AnonHugePages: 87552000 kB ^^^^^^^ This leads to the performance regression I'm referring to in later workloads. V2 used to have a similar effect to V1, but not nearly as bad. Recent updates upstream fixed this in V2. The workload tests multiple FS types so this is most likely not a FS specific issue either. > If you haven't done so, it could be useful to litter shrink_node() and > get_scan_count() with trace_printk() to try to make sense of all the > decisions that result in it swapping. Will do :) I was originally doing some BPF tracing that lead me to find the DEACTIVE_ANON case. Thanks, -- Nico