Re: [RFC 0/4] mm: zswap: add support for zswapin of large folios

Barry Song <21cnbao@xxxxxxxxx> · Wed, 23 Oct 2024 09:46:50 +1300

On Wed, Oct 23, 2024 at 4:26 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
>
>
>
> On 21/10/2024 11:40, Usama Arif wrote:
> >
> >
> > On 21/10/2024 06:09, Barry Song wrote:
> >> On Fri, Oct 18, 2024 at 11:50 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
> >>>
> >>> After large folio zswapout support added in [1], this patch adds
> >>> support for zswapin of large folios to bring it on par with zram.
> >>> This series makes sure that the benefits of large folios (fewer
> >>> page faults, batched PTE and rmap manipulation, reduced lru list,
> >>> TLB coalescing (for arm64 and amd)) are not lost at swap out when
> >>> using zswap.
> >>>
> >>> It builds on top of [2] which added large folio swapin support for
> >>> zram and provides the same level of large folio swapin support as
> >>> zram, i.e. only supporting swap count == 1.
> >>>
> >>> Patch 1 skips swapcache for swapping in zswap pages, this should improve
> >>> no readahead swapin performance [3], and also allows us to build on large
> >>> folio swapin support added in [2], hence is a prerequisite for patch 3.
> >>>
> >>> Patch 3 adds support for large folio zswapin. This patch does not add
> >>> support for hybrid backends (i.e. folios partly present swap and zswap).
> >>>
> >>> The main performance benefit comes from maintaining large folios *after*
> >>> swapin, large folio performance improvements have been mentioned in previous
> >>> series posted on it [2],[4], so have not added those. Below is a simple
> >>> microbenchmark to measure the time needed *for* zswpin of 1G memory (along
> >>> with memory integrity check).
> >>>
> >>>                                 |  no mTHP (ms) | 1M mTHP enabled (ms)
> >>> Base kernel                     |   1165        |    1163
> >>> Kernel with mTHP zswpin series  |   1203        |     738
> >>
> >> Hi Usama,
> >> Do you know where this minor regression for non-mTHP comes from?
> >> As you even have skipped swapcache for small folios in zswap in patch1,
> >> that part should have some gain? is it because of zswap_present_test()?
> >>
> >
> > Hi Barry,
> >
> > The microbenchmark does a sequential read of 1G of memory, so it probably
> > isnt very representative of real world usecases. This also means that
> > swap_vma_readahead is able to readahead accurately all pages in its window.
> > With this patch series, if doing 4K swapin, you get 1G/4K calls of fast
> > do_swap_page. Without this patch, you get 1G/(4K*readahead window) of slow
> > do_swap_page calls. I had added some prints and I was seeing 8 pages being
> > readahead in 1 do_swap_page. The larger number of calls causes the slight
> > regression (eventhough they are quite fast). I think in a realistic scenario,
> > where readahead window wont be as large, there wont be a regression.
> > The cost of zswap_present_test in the whole call stack of swapping page is
> > very low and I think can be ignored.
> >
> > I think the more interesting thing is what Kanchana pointed out in
> > https://lore.kernel.org/all/f2f2053f-ec5f-46a4-800d-50a3d2e61bff@xxxxxxxxx/
> > I am curious, did you see this when testing large folio swapin and compression
> > at 4K granuality? Its looks like swap thrashing so I think it would be common
> > between zswap and zram. I dont have larger granuality zswap compression done,
> > which is why I think there is a regression in time taken. (It could be because
> > its tested on intel as well).
> >
> > Thanks,
> > Usama
> >
>
> Hi,
>
> So I have been doing some benchmarking after Kanchana pointed out a performance
> regression in [1] of swapping in large folio. I would love to get thoughts from
> zram folks on this, as thats where large folio swapin was first added [2].
> As far as I can see, the current support in zram is doing large folio swapin
> at 4K granuality. The large granuality compression in [3] which was posted
> in March is not merged, so I am currently comparing upstream zram with this series.
>
> With the microbenchmark below of timing 1G swapin, there was a very large improvement
> in performance by using this series. I think similar numbers would be seen in zram.

Imagine running several apps on a phone and switching
between them: A → B → C → D → E … → A → B … The app
currently on the screen retains its memory, while the ones
sent to the background are swapped out. When we bring
those apps back to the foreground, their memory is restored.
This behavior is quite similar to what you're seeing with
your microbenchmark.

>
> But when doing kernel build test, Kanchana saw a regression in [1]. I believe
> its because of swap thrashing (causing large zswap activity), due to larger page swapin.
> The part of the code that decides large folio swapin is the same between zswap and zram,
> so I believe this would be observed in zram as well.

Is this an extreme case where the workload's working set far
exceeds the available memory by memcg limitation? I doubt mTHP
would provide any real benefit from the start if the workload is bound to
experience swap thrashing. What if we disable mTHP entirely?

>
> My initial thought was this might be because its intel, where you dont have the advantage
> of TLB coalescing, so tested on AMD and ARM, but the regression is there on AMD
> and ARM as well, though a bit less (have added the numbers below).
>
> The numbers show that the zswap activity increases and page faults decrease.
> Overall this does result in sys time increasing and real time slightly increases,
> likely because the cost of increased zswap activity is more than the benefit of
> lower page faults.
> I can see in [3] that pagefaults reduced in zram as well.
>
> Large folio swapin shows good numbers in microbenchmarks that just target reduce page
> faults and sequential swapin only, but not in kernel build test. Is a similar regression
> observed with zram when enabling large folio swapin on kernel build test? Maybe large
> folio swapin makes more sense on workloads where mappings are kept for a longer time?
>

I suspect this is because mTHP doesn't always benefit workloads
when available memory is quite limited compared to the working set.
In that case, mTHP swap-in might introduce more features that
exacerbate the problem. We used to have an extra control "swapin_enabled"
for swap-in, but it never gained much traction:
https://lore.kernel.org/linux-mm/20240726094618.401593-5-21cnbao@xxxxxxxxx/
We can reconsider whether to include the knob, but if it's better
to disable mTHP entirely for these cases, we can still adhere to
the policy of "enabled".

Using large block compression and decompression in zRAM will
significantly reduce CPU usage, likely making the issue unnoticeable.
However, the default minimum size for large block support is currently
set to 64KB(ZSMALLOC_MULTI_PAGES_ORDER = 4).

>
> Kernel build numbers in cgroup with memory.max=4G to trigger zswap
> Command for AMD: make defconfig; time make -j$(nproc) bzImage
> Command for ARM: make defconfig; time make -j$(nproc) Image
>
>
> AMD 16K+32K THP=always
> metric         mm-unstable      mm-unstable + large folio zswapin series
> real           1m23.038s        1m23.050s
> user           53m57.210s       53m53.437s
> sys            7m24.592s        7m48.843s
> zswpin         612070           999244
> zswpout        2226403          2347979
> pgfault        20667366         20481728
> pgmajfault     385887           269117
>
> AMD 16K+32K+64K THP=always
> metric         mm-unstable      mm-unstable + large folio zswapin series
> real           1m22.975s        1m23.266s
> user           53m51.302s       53m51.069s
> sys            7m40.168s        7m57.104s
> zswpin         676492           1258573
> zswpout        2449839          2714767
> pgfault        17540746         17296555
> pgmajfault     429629           307495
> --------------------------
> ARM 16K+32K THP=always
> metric         mm-unstable      mm-unstable + large folio zswapin series
> real           0m51.168s        0m52.086s
> user           25m14.715s       25m15.765s
> sys            17m18.856s       18m8.031s
> zswpin         3904129          7339245
> zswpout        11171295         13473461
> pgfault        37313345         36011338
> pgmajfault     2726253          1932642
>
>
> ARM 16K+32K+64K THP=always
> metric         mm-unstable      mm-unstable + large folio zswapin series
> real           0m52.017s        0m53.828s
> user           25m2.742s        25m0.046s
> sys            18m24.525s       20m26.207s
> zswpin         4853571          8908664
> zswpout        12297199         15768764
> pgfault        32158152         30425519
> pgmajfault     3320717          2237015
>
>
> Thanks!
> Usama
>
>
> [1] https://lore.kernel.org/all/f2f2053f-ec5f-46a4-800d-50a3d2e61bff@xxxxxxxxx/
> [2] https://lore.kernel.org/all/20240821074541.516249-3-hanchuanhua@xxxxxxxx/
> [3] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@xxxxxxxxx/
>
> >
> >>>
> >>> The time measured was pretty consistent between runs (~1-2% variation).
> >>> There is 36% improvement in zswapin time with 1M folios. The percentage
> >>> improvement is likely to be more if the memcmp is removed.
> >>>
> >>> diff --git a/tools/testing/selftests/cgroup/test_zswap.c b/tools/testing/selftests/cgroup/test_zswap.c
> >>> index 40de679248b8..77068c577c86 100644
> >>> --- a/tools/testing/selftests/cgroup/test_zswap.c
> >>> +++ b/tools/testing/selftests/cgroup/test_zswap.c
> >>> @@ -9,6 +9,8 @@
> >>>  #include <string.h>
> >>>  #include <sys/wait.h>
> >>>  #include <sys/mman.h>
> >>> +#include <sys/time.h>
> >>> +#include <malloc.h>
> >>>
> >>>  #include "../kselftest.h"
> >>>  #include "cgroup_util.h"
> >>> @@ -407,6 +409,74 @@ static int test_zswap_writeback_disabled(const char *root)
> >>>         return test_zswap_writeback(root, false);
> >>>  }
> >>>
> >>> +static int zswapin_perf(const char *cgroup, void *arg)
> >>> +{
> >>> +       long pagesize = sysconf(_SC_PAGESIZE);
> >>> +       size_t memsize = MB(1*1024);
> >>> +       char buf[pagesize];
> >>> +       int ret = -1;
> >>> +       char *mem;
> >>> +       struct timeval start, end;
> >>> +
> >>> +       mem = (char *)memalign(2*1024*1024, memsize);
> >>> +       if (!mem)
> >>> +               return ret;
> >>> +
> >>> +       /*
> >>> +        * Fill half of each page with increasing data, and keep other
> >>> +        * half empty, this will result in data that is still compressible
> >>> +        * and ends up in zswap, with material zswap usage.
> >>> +        */
> >>> +       for (int i = 0; i < pagesize; i++)
> >>> +               buf[i] = i < pagesize/2 ? (char) i : 0;
> >>> +
> >>> +       for (int i = 0; i < memsize; i += pagesize)
> >>> +               memcpy(&mem[i], buf, pagesize);
> >>> +
> >>> +       /* Try and reclaim allocated memory */
> >>> +       if (cg_write_numeric(cgroup, "memory.reclaim", memsize)) {
> >>> +               ksft_print_msg("Failed to reclaim all of the requested memory\n");
> >>> +               goto out;
> >>> +       }
> >>> +
> >>> +       gettimeofday(&start, NULL);
> >>> +       /* zswpin */
> >>> +       for (int i = 0; i < memsize; i += pagesize) {
> >>> +               if (memcmp(&mem[i], buf, pagesize)) {
> >>> +                       ksft_print_msg("invalid memory\n");
> >>> +                       goto out;
> >>> +               }
> >>> +       }
> >>> +       gettimeofday(&end, NULL);
> >>> +       printf ("zswapin took %fms to run.\n", (end.tv_sec - start.tv_sec)*1000 + (double)(end.tv_usec - start.tv_usec) / 1000);
> >>> +       ret = 0;
> >>> +out:
> >>> +       free(mem);
> >>> +       return ret;
> >>> +}
> >>> +
> >>> +static int test_zswapin_perf(const char *root)
> >>> +{
> >>> +       int ret = KSFT_FAIL;
> >>> +       char *test_group;
> >>> +
> >>> +       test_group = cg_name(root, "zswapin_perf_test");
> >>> +       if (!test_group)
> >>> +               goto out;
> >>> +       if (cg_create(test_group))
> >>> +               goto out;
> >>> +
> >>> +       if (cg_run(test_group, zswapin_perf, NULL))
> >>> +               goto out;
> >>> +
> >>> +       ret = KSFT_PASS;
> >>> +out:
> >>> +       cg_destroy(test_group);
> >>> +       free(test_group);
> >>> +       return ret;
> >>> +}
> >>> +
> >>>  /*
> >>>   * When trying to store a memcg page in zswap, if the memcg hits its memory
> >>>   * limit in zswap, writeback should affect only the zswapped pages of that
> >>> @@ -584,6 +654,7 @@ struct zswap_test {
> >>>         T(test_zswapin),
> >>>         T(test_zswap_writeback_enabled),
> >>>         T(test_zswap_writeback_disabled),
> >>> +       T(test_zswapin_perf),
> >>>         T(test_no_kmem_bypass),
> >>>         T(test_no_invasive_cgroup_shrink),
> >>>  };
> >>>
> >>> [1] https://lore.kernel.org/all/20241001053222.6944-1-kanchana.p.sridhar@xxxxxxxxx/
> >>> [2] https://lore.kernel.org/all/20240821074541.516249-1-hanchuanhua@xxxxxxxx/
> >>> [3] https://lore.kernel.org/all/1505886205-9671-5-git-send-email-minchan@xxxxxxxxxx/T/#u
> >>> [4] https://lwn.net/Articles/955575/
> >>>
> >>> Usama Arif (4):
> >>>   mm/zswap: skip swapcache for swapping in zswap pages
> >>>   mm/zswap: modify zswap_decompress to accept page instead of folio
> >>>   mm/zswap: add support for large folio zswapin
> >>>   mm/zswap: count successful large folio zswap loads
> >>>
> >>>  Documentation/admin-guide/mm/transhuge.rst |   3 +
> >>>  include/linux/huge_mm.h                    |   1 +
> >>>  include/linux/zswap.h                      |   6 ++
> >>>  mm/huge_memory.c                           |   3 +
> >>>  mm/memory.c                                |  16 +--
> >>>  mm/page_io.c                               |   2 +-
> >>>  mm/zswap.c                                 | 120 ++++++++++++++-------
> >>>  7 files changed, 99 insertions(+), 52 deletions(-)
> >>>
> >>> --
> >>> 2.43.5
> >>>
> >>

Thanks
Barry