On 11/10/2023 07:37, Huang, Ying wrote: > Ryan Roberts <ryan.roberts@xxxxxxx> writes: > > [...] > >> Finally on testing, I've run the mm selftests and see no regressions, but I >> don't think there is anything in there specifically aimed towards swap? Are >> there any functional or performance tests that I should run? It would certainly >> be good to confirm I haven't regressed PMD-size THP swap performance. > > I have used swap sub test case of vm-scalbility to test. > > https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/ I ended up using `usemem`, which is the core of this test suite, but deviated from the pre-canned test case to allow me to use anonymous memory and get numbers for small-sized THP (this is a very useful tool - thanks for pointing it out!) I've run the tests on Ampere Altra, set up with a 35G block ram device as the swap device and from inside a memcg limited to 40G memory. I've then run `usemem` with 70 processes (each has its own core), each allocating and writing 1G of memory. I've repeated everything 5 times and taken the mean and stdev: Mean Performance Improvement vs 4K/baseline | alloc size | baseline | remove-huge-flag | swap-file-small-thp | | | v6.6-rc4+anonfolio | + patch 1 | + patch 2 | |:-----------|--------------------:|--------------------:|--------------------:| | 4K Page | 0.0% | 2.3% | 9.1% | | 64K THP | -44.1% | -46.3% | 30.6% | | 2M THP | 56.0% | 54.2% | 60.1% | Standard Deviation as Percentage of Mean | alloc size | baseline | remove-huge-flag | swap-file-small-thp | | | v6.6-rc4+anonfolio | + patch 1 | + patch 2 | |:-----------|--------------------:|--------------------:|--------------------:| | 4K Page | 3.4% | 7.1% | 1.7% | | 64K THP | 1.9% | 5.6% | 7.7% | | 2M THP | 1.9% | 2.1% | 3.2% | I don't see any meaningful performance cost to removing the HUGE flag, so hopefully this gives us confidence to move forward with patch 1. You can indeed see the performance regression in the baseline when THP is configured to allocate small-sized THP only (in this case 64K). And you can see the regression is fixed by patch 2, which avoids splitting the THP and thus avoids the extra TLBIs. This correlates with what I saw in kernel compilation workload. Huang Ying, based on these results, do you still want me to persue a per-cpu solution to avoid potential contention on the swap info lock? - I proposed in the thread against patch 2 to do this in the swap_slots layer if so, rather than in swapfile.c directly (I'm not sure how your original proposal would actually work?). But based on these results, its not obvious to me that there is a definite problem here, and it might be simpler to avoid the complexity? Thanks, Ryan > > -- > Best Regards, > Huang, Ying