On 21/10/2024 06:09, Barry Song wrote: > On Fri, Oct 18, 2024 at 11:50 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote: >> >> After large folio zswapout support added in [1], this patch adds >> support for zswapin of large folios to bring it on par with zram. >> This series makes sure that the benefits of large folios (fewer >> page faults, batched PTE and rmap manipulation, reduced lru list, >> TLB coalescing (for arm64 and amd)) are not lost at swap out when >> using zswap. >> >> It builds on top of [2] which added large folio swapin support for >> zram and provides the same level of large folio swapin support as >> zram, i.e. only supporting swap count == 1. >> >> Patch 1 skips swapcache for swapping in zswap pages, this should improve >> no readahead swapin performance [3], and also allows us to build on large >> folio swapin support added in [2], hence is a prerequisite for patch 3. >> >> Patch 3 adds support for large folio zswapin. This patch does not add >> support for hybrid backends (i.e. folios partly present swap and zswap). >> >> The main performance benefit comes from maintaining large folios *after* >> swapin, large folio performance improvements have been mentioned in previous >> series posted on it [2],[4], so have not added those. Below is a simple >> microbenchmark to measure the time needed *for* zswpin of 1G memory (along >> with memory integrity check). >> >> | no mTHP (ms) | 1M mTHP enabled (ms) >> Base kernel | 1165 | 1163 >> Kernel with mTHP zswpin series | 1203 | 738 > > Hi Usama, > Do you know where this minor regression for non-mTHP comes from? > As you even have skipped swapcache for small folios in zswap in patch1, > that part should have some gain? is it because of zswap_present_test()? > Hi Barry, The microbenchmark does a sequential read of 1G of memory, so it probably isnt very representative of real world usecases. This also means that swap_vma_readahead is able to readahead accurately all pages in its window. With this patch series, if doing 4K swapin, you get 1G/4K calls of fast do_swap_page. Without this patch, you get 1G/(4K*readahead window) of slow do_swap_page calls. I had added some prints and I was seeing 8 pages being readahead in 1 do_swap_page. The larger number of calls causes the slight regression (eventhough they are quite fast). I think in a realistic scenario, where readahead window wont be as large, there wont be a regression. The cost of zswap_present_test in the whole call stack of swapping page is very low and I think can be ignored. I think the more interesting thing is what Kanchana pointed out in https://lore.kernel.org/all/f2f2053f-ec5f-46a4-800d-50a3d2e61bff@xxxxxxxxx/ I am curious, did you see this when testing large folio swapin and compression at 4K granuality? Its looks like swap thrashing so I think it would be common between zswap and zram. I dont have larger granuality zswap compression done, which is why I think there is a regression in time taken. (It could be because its tested on intel as well). Thanks, Usama >> >> The time measured was pretty consistent between runs (~1-2% variation). >> There is 36% improvement in zswapin time with 1M folios. The percentage >> improvement is likely to be more if the memcmp is removed. >> >> diff --git a/tools/testing/selftests/cgroup/test_zswap.c b/tools/testing/selftests/cgroup/test_zswap.c >> index 40de679248b8..77068c577c86 100644 >> --- a/tools/testing/selftests/cgroup/test_zswap.c >> +++ b/tools/testing/selftests/cgroup/test_zswap.c >> @@ -9,6 +9,8 @@ >> #include <string.h> >> #include <sys/wait.h> >> #include <sys/mman.h> >> +#include <sys/time.h> >> +#include <malloc.h> >> >> #include "../kselftest.h" >> #include "cgroup_util.h" >> @@ -407,6 +409,74 @@ static int test_zswap_writeback_disabled(const char *root) >> return test_zswap_writeback(root, false); >> } >> >> +static int zswapin_perf(const char *cgroup, void *arg) >> +{ >> + long pagesize = sysconf(_SC_PAGESIZE); >> + size_t memsize = MB(1*1024); >> + char buf[pagesize]; >> + int ret = -1; >> + char *mem; >> + struct timeval start, end; >> + >> + mem = (char *)memalign(2*1024*1024, memsize); >> + if (!mem) >> + return ret; >> + >> + /* >> + * Fill half of each page with increasing data, and keep other >> + * half empty, this will result in data that is still compressible >> + * and ends up in zswap, with material zswap usage. >> + */ >> + for (int i = 0; i < pagesize; i++) >> + buf[i] = i < pagesize/2 ? (char) i : 0; >> + >> + for (int i = 0; i < memsize; i += pagesize) >> + memcpy(&mem[i], buf, pagesize); >> + >> + /* Try and reclaim allocated memory */ >> + if (cg_write_numeric(cgroup, "memory.reclaim", memsize)) { >> + ksft_print_msg("Failed to reclaim all of the requested memory\n"); >> + goto out; >> + } >> + >> + gettimeofday(&start, NULL); >> + /* zswpin */ >> + for (int i = 0; i < memsize; i += pagesize) { >> + if (memcmp(&mem[i], buf, pagesize)) { >> + ksft_print_msg("invalid memory\n"); >> + goto out; >> + } >> + } >> + gettimeofday(&end, NULL); >> + printf ("zswapin took %fms to run.\n", (end.tv_sec - start.tv_sec)*1000 + (double)(end.tv_usec - start.tv_usec) / 1000); >> + ret = 0; >> +out: >> + free(mem); >> + return ret; >> +} >> + >> +static int test_zswapin_perf(const char *root) >> +{ >> + int ret = KSFT_FAIL; >> + char *test_group; >> + >> + test_group = cg_name(root, "zswapin_perf_test"); >> + if (!test_group) >> + goto out; >> + if (cg_create(test_group)) >> + goto out; >> + >> + if (cg_run(test_group, zswapin_perf, NULL)) >> + goto out; >> + >> + ret = KSFT_PASS; >> +out: >> + cg_destroy(test_group); >> + free(test_group); >> + return ret; >> +} >> + >> /* >> * When trying to store a memcg page in zswap, if the memcg hits its memory >> * limit in zswap, writeback should affect only the zswapped pages of that >> @@ -584,6 +654,7 @@ struct zswap_test { >> T(test_zswapin), >> T(test_zswap_writeback_enabled), >> T(test_zswap_writeback_disabled), >> + T(test_zswapin_perf), >> T(test_no_kmem_bypass), >> T(test_no_invasive_cgroup_shrink), >> }; >> >> [1] https://lore.kernel.org/all/20241001053222.6944-1-kanchana.p.sridhar@xxxxxxxxx/ >> [2] https://lore.kernel.org/all/20240821074541.516249-1-hanchuanhua@xxxxxxxx/ >> [3] https://lore.kernel.org/all/1505886205-9671-5-git-send-email-minchan@xxxxxxxxxx/T/#u >> [4] https://lwn.net/Articles/955575/ >> >> Usama Arif (4): >> mm/zswap: skip swapcache for swapping in zswap pages >> mm/zswap: modify zswap_decompress to accept page instead of folio >> mm/zswap: add support for large folio zswapin >> mm/zswap: count successful large folio zswap loads >> >> Documentation/admin-guide/mm/transhuge.rst | 3 + >> include/linux/huge_mm.h | 1 + >> include/linux/zswap.h | 6 ++ >> mm/huge_memory.c | 3 + >> mm/memory.c | 16 +-- >> mm/page_io.c | 2 +- >> mm/zswap.c | 120 ++++++++++++++------- >> 7 files changed, 99 insertions(+), 52 deletions(-) >> >> -- >> 2.43.5 >> > > Thanks > barry