On Wed, Oct 23, 2024 at 10:17 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > > > > On 22/10/2024 21:46, Barry Song wrote: > > On Wed, Oct 23, 2024 at 4:26 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > >> > >> > >> > >> On 21/10/2024 11:40, Usama Arif wrote: > >>> > >>> > >>> On 21/10/2024 06:09, Barry Song wrote: > >>>> On Fri, Oct 18, 2024 at 11:50 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > >>>>> > >>>>> After large folio zswapout support added in [1], this patch adds > >>>>> support for zswapin of large folios to bring it on par with zram. > >>>>> This series makes sure that the benefits of large folios (fewer > >>>>> page faults, batched PTE and rmap manipulation, reduced lru list, > >>>>> TLB coalescing (for arm64 and amd)) are not lost at swap out when > >>>>> using zswap. > >>>>> > >>>>> It builds on top of [2] which added large folio swapin support for > >>>>> zram and provides the same level of large folio swapin support as > >>>>> zram, i.e. only supporting swap count == 1. > >>>>> > >>>>> Patch 1 skips swapcache for swapping in zswap pages, this should improve > >>>>> no readahead swapin performance [3], and also allows us to build on large > >>>>> folio swapin support added in [2], hence is a prerequisite for patch 3. > >>>>> > >>>>> Patch 3 adds support for large folio zswapin. This patch does not add > >>>>> support for hybrid backends (i.e. folios partly present swap and zswap). > >>>>> > >>>>> The main performance benefit comes from maintaining large folios *after* > >>>>> swapin, large folio performance improvements have been mentioned in previous > >>>>> series posted on it [2],[4], so have not added those. Below is a simple > >>>>> microbenchmark to measure the time needed *for* zswpin of 1G memory (along > >>>>> with memory integrity check). > >>>>> > >>>>> | no mTHP (ms) | 1M mTHP enabled (ms) > >>>>> Base kernel | 1165 | 1163 > >>>>> Kernel with mTHP zswpin series | 1203 | 738 > >>>> > >>>> Hi Usama, > >>>> Do you know where this minor regression for non-mTHP comes from? > >>>> As you even have skipped swapcache for small folios in zswap in patch1, > >>>> that part should have some gain? is it because of zswap_present_test()? > >>>> > >>> > >>> Hi Barry, > >>> > >>> The microbenchmark does a sequential read of 1G of memory, so it probably > >>> isnt very representative of real world usecases. This also means that > >>> swap_vma_readahead is able to readahead accurately all pages in its window. > >>> With this patch series, if doing 4K swapin, you get 1G/4K calls of fast > >>> do_swap_page. Without this patch, you get 1G/(4K*readahead window) of slow > >>> do_swap_page calls. I had added some prints and I was seeing 8 pages being > >>> readahead in 1 do_swap_page. The larger number of calls causes the slight > >>> regression (eventhough they are quite fast). I think in a realistic scenario, > >>> where readahead window wont be as large, there wont be a regression. > >>> The cost of zswap_present_test in the whole call stack of swapping page is > >>> very low and I think can be ignored. > >>> > >>> I think the more interesting thing is what Kanchana pointed out in > >>> https://lore.kernel.org/all/f2f2053f-ec5f-46a4-800d-50a3d2e61bff@xxxxxxxxx/ > >>> I am curious, did you see this when testing large folio swapin and compression > >>> at 4K granuality? Its looks like swap thrashing so I think it would be common > >>> between zswap and zram. I dont have larger granuality zswap compression done, > >>> which is why I think there is a regression in time taken. (It could be because > >>> its tested on intel as well). > >>> > >>> Thanks, > >>> Usama > >>> > >> > >> Hi, > >> > >> So I have been doing some benchmarking after Kanchana pointed out a performance > >> regression in [1] of swapping in large folio. I would love to get thoughts from > >> zram folks on this, as thats where large folio swapin was first added [2]. > >> As far as I can see, the current support in zram is doing large folio swapin > >> at 4K granuality. The large granuality compression in [3] which was posted > >> in March is not merged, so I am currently comparing upstream zram with this series. > >> > >> With the microbenchmark below of timing 1G swapin, there was a very large improvement > >> in performance by using this series. I think similar numbers would be seen in zram. > > > > Imagine running several apps on a phone and switching > > between them: A → B → C → D → E … → A → B … The app > > currently on the screen retains its memory, while the ones > > sent to the background are swapped out. When we bring > > those apps back to the foreground, their memory is restored. > > This behavior is quite similar to what you're seeing with > > your microbenchmark. > > > > Hi Barry, > > Thanks for explaining this! Do you know if there is some open source benchmark > we could use to show an improvement in app switching with large folios? > I’m fairly certain the Android team has this benchmark, but it’s not open source. A straightforward way to simulate this is to use a script that cyclically launches multiple applications, such as Chrome, Firefox, Office, PDF, and others. for example: launch chrome; launch firefox; launch youtube; .... launch chrome; launch firefox; .... On Android, we have "Android activity manager 'am' command" to do that. https://gist.github.com/tsohr/5711945 Not quite sure if other windows managers have similar tools. > Also I guess swap thrashing can happen when apps are brought back to foreground? > Typically, the foreground app doesn't experience much swapping, as it is the most recently or frequently used. However, this may not hold for very low-end phones, where memory is significantly less than the app's working set. For instance, we can't expect a good user experience when playing a large game that requires 8GB of memory on a 4GB phone! :-) And for low-end phones, we never even enable mTHP. > >> > >> But when doing kernel build test, Kanchana saw a regression in [1]. I believe > >> its because of swap thrashing (causing large zswap activity), due to larger page swapin. > >> The part of the code that decides large folio swapin is the same between zswap and zram, > >> so I believe this would be observed in zram as well. > > > > Is this an extreme case where the workload's working set far > > exceeds the available memory by memcg limitation? I doubt mTHP > > would provide any real benefit from the start if the workload is bound to > > experience swap thrashing. What if we disable mTHP entirely? > > > > I would agree, this is an extreme case. I wanted (z)swap activity to happen so limited > memory.max to 4G. > > mTHP is beneficial in kernel test benchmarking going from no mTHP to 16K: > > ARM make defconfig; time make -j$(nproc) Image, cgroup memory.max=4G > metric no mTHP 16K mTHP=always > real 1m0.613s 0m52.008s > user 25m23.028s 25m19.488s > sys 25m45.466s 18m11.640s > zswpin 1911194 3108438 > zswpout 6880815 9374628 > pgfault 120430166 48976658 > pgmajfault 1580674 2327086 > > Interesting! We never use a phone to build the Linux kernel, but let me see if I can find some other machines to reproduce your data. > > > >> > >> My initial thought was this might be because its intel, where you dont have the advantage > >> of TLB coalescing, so tested on AMD and ARM, but the regression is there on AMD > >> and ARM as well, though a bit less (have added the numbers below). > >> > >> The numbers show that the zswap activity increases and page faults decrease. > >> Overall this does result in sys time increasing and real time slightly increases, > >> likely because the cost of increased zswap activity is more than the benefit of > >> lower page faults. > >> I can see in [3] that pagefaults reduced in zram as well. > >> > >> Large folio swapin shows good numbers in microbenchmarks that just target reduce page > >> faults and sequential swapin only, but not in kernel build test. Is a similar regression > >> observed with zram when enabling large folio swapin on kernel build test? Maybe large > >> folio swapin makes more sense on workloads where mappings are kept for a longer time? > >> > > > > I suspect this is because mTHP doesn't always benefit workloads > > when available memory is quite limited compared to the working set. > > In that case, mTHP swap-in might introduce more features that > > exacerbate the problem. We used to have an extra control "swapin_enabled" > > for swap-in, but it never gained much traction: > > https://lore.kernel.org/linux-mm/20240726094618.401593-5-21cnbao@xxxxxxxxx/ > > We can reconsider whether to include the knob, but if it's better > > to disable mTHP entirely for these cases, we can still adhere to > > the policy of "enabled". > > > Yes I think this makes sense to have. The only thing is, its too many knobs! > I personally think its already difficult to decide upto which mTHP size we > should enable (and I think this changes per workload). But if we add swapin_enabled > on top of that it can make things more difficult. > > > Using large block compression and decompression in zRAM will > > significantly reduce CPU usage, likely making the issue unnoticeable. > > However, the default minimum size for large block support is currently > > set to 64KB(ZSMALLOC_MULTI_PAGES_ORDER = 4). > > > > I saw that the patch was sent in March, and there werent any updates after? > Maybe I can try and cherry-pick that and see if we can develop large > granularity compression for zswap. will provide an updated version next week. > > >> > >> Kernel build numbers in cgroup with memory.max=4G to trigger zswap > >> Command for AMD: make defconfig; time make -j$(nproc) bzImage > >> Command for ARM: make defconfig; time make -j$(nproc) Image > >> > >> > >> AMD 16K+32K THP=always > >> metric mm-unstable mm-unstable + large folio zswapin series > >> real 1m23.038s 1m23.050s > >> user 53m57.210s 53m53.437s > >> sys 7m24.592s 7m48.843s > >> zswpin 612070 999244 > >> zswpout 2226403 2347979 > >> pgfault 20667366 20481728 > >> pgmajfault 385887 269117 > >> > >> AMD 16K+32K+64K THP=always > >> metric mm-unstable mm-unstable + large folio zswapin series > >> real 1m22.975s 1m23.266s > >> user 53m51.302s 53m51.069s > >> sys 7m40.168s 7m57.104s > >> zswpin 676492 1258573 > >> zswpout 2449839 2714767 > >> pgfault 17540746 17296555 > >> pgmajfault 429629 307495 > >> -------------------------- > >> ARM 16K+32K THP=always > >> metric mm-unstable mm-unstable + large folio zswapin series > >> real 0m51.168s 0m52.086s > >> user 25m14.715s 25m15.765s > >> sys 17m18.856s 18m8.031s > >> zswpin 3904129 7339245 > >> zswpout 11171295 13473461 > >> pgfault 37313345 36011338 > >> pgmajfault 2726253 1932642 > >> > >> > >> ARM 16K+32K+64K THP=always > >> metric mm-unstable mm-unstable + large folio zswapin series > >> real 0m52.017s 0m53.828s > >> user 25m2.742s 25m0.046s > >> sys 18m24.525s 20m26.207s > >> zswpin 4853571 8908664 > >> zswpout 12297199 15768764 > >> pgfault 32158152 30425519 > >> pgmajfault 3320717 2237015 > >> > >> > >> Thanks! > >> Usama > >> > >> > >> [1] https://lore.kernel.org/all/f2f2053f-ec5f-46a4-800d-50a3d2e61bff@xxxxxxxxx/ > >> [2] https://lore.kernel.org/all/20240821074541.516249-3-hanchuanhua@xxxxxxxx/ > >> [3] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@xxxxxxxxx/ > >> > >>> > >>>>> > >>>>> The time measured was pretty consistent between runs (~1-2% variation). > >>>>> There is 36% improvement in zswapin time with 1M folios. The percentage > >>>>> improvement is likely to be more if the memcmp is removed. > >>>>> > >>>>> diff --git a/tools/testing/selftests/cgroup/test_zswap.c b/tools/testing/selftests/cgroup/test_zswap.c > >>>>> index 40de679248b8..77068c577c86 100644 > >>>>> --- a/tools/testing/selftests/cgroup/test_zswap.c > >>>>> +++ b/tools/testing/selftests/cgroup/test_zswap.c > >>>>> @@ -9,6 +9,8 @@ > >>>>> #include <string.h> > >>>>> #include <sys/wait.h> > >>>>> #include <sys/mman.h> > >>>>> +#include <sys/time.h> > >>>>> +#include <malloc.h> > >>>>> > >>>>> #include "../kselftest.h" > >>>>> #include "cgroup_util.h" > >>>>> @@ -407,6 +409,74 @@ static int test_zswap_writeback_disabled(const char *root) > >>>>> return test_zswap_writeback(root, false); > >>>>> } > >>>>> > >>>>> +static int zswapin_perf(const char *cgroup, void *arg) > >>>>> +{ > >>>>> + long pagesize = sysconf(_SC_PAGESIZE); > >>>>> + size_t memsize = MB(1*1024); > >>>>> + char buf[pagesize]; > >>>>> + int ret = -1; > >>>>> + char *mem; > >>>>> + struct timeval start, end; > >>>>> + > >>>>> + mem = (char *)memalign(2*1024*1024, memsize); > >>>>> + if (!mem) > >>>>> + return ret; > >>>>> + > >>>>> + /* > >>>>> + * Fill half of each page with increasing data, and keep other > >>>>> + * half empty, this will result in data that is still compressible > >>>>> + * and ends up in zswap, with material zswap usage. > >>>>> + */ > >>>>> + for (int i = 0; i < pagesize; i++) > >>>>> + buf[i] = i < pagesize/2 ? (char) i : 0; > >>>>> + > >>>>> + for (int i = 0; i < memsize; i += pagesize) > >>>>> + memcpy(&mem[i], buf, pagesize); > >>>>> + > >>>>> + /* Try and reclaim allocated memory */ > >>>>> + if (cg_write_numeric(cgroup, "memory.reclaim", memsize)) { > >>>>> + ksft_print_msg("Failed to reclaim all of the requested memory\n"); > >>>>> + goto out; > >>>>> + } > >>>>> + > >>>>> + gettimeofday(&start, NULL); > >>>>> + /* zswpin */ > >>>>> + for (int i = 0; i < memsize; i += pagesize) { > >>>>> + if (memcmp(&mem[i], buf, pagesize)) { > >>>>> + ksft_print_msg("invalid memory\n"); > >>>>> + goto out; > >>>>> + } > >>>>> + } > >>>>> + gettimeofday(&end, NULL); > >>>>> + printf ("zswapin took %fms to run.\n", (end.tv_sec - start.tv_sec)*1000 + (double)(end.tv_usec - start.tv_usec) / 1000); > >>>>> + ret = 0; > >>>>> +out: > >>>>> + free(mem); > >>>>> + return ret; > >>>>> +} > >>>>> + > >>>>> +static int test_zswapin_perf(const char *root) > >>>>> +{ > >>>>> + int ret = KSFT_FAIL; > >>>>> + char *test_group; > >>>>> + > >>>>> + test_group = cg_name(root, "zswapin_perf_test"); > >>>>> + if (!test_group) > >>>>> + goto out; > >>>>> + if (cg_create(test_group)) > >>>>> + goto out; > >>>>> + > >>>>> + if (cg_run(test_group, zswapin_perf, NULL)) > >>>>> + goto out; > >>>>> + > >>>>> + ret = KSFT_PASS; > >>>>> +out: > >>>>> + cg_destroy(test_group); > >>>>> + free(test_group); > >>>>> + return ret; > >>>>> +} > >>>>> + > >>>>> /* > >>>>> * When trying to store a memcg page in zswap, if the memcg hits its memory > >>>>> * limit in zswap, writeback should affect only the zswapped pages of that > >>>>> @@ -584,6 +654,7 @@ struct zswap_test { > >>>>> T(test_zswapin), > >>>>> T(test_zswap_writeback_enabled), > >>>>> T(test_zswap_writeback_disabled), > >>>>> + T(test_zswapin_perf), > >>>>> T(test_no_kmem_bypass), > >>>>> T(test_no_invasive_cgroup_shrink), > >>>>> }; > >>>>> > >>>>> [1] https://lore.kernel.org/all/20241001053222.6944-1-kanchana.p.sridhar@xxxxxxxxx/ > >>>>> [2] https://lore.kernel.org/all/20240821074541.516249-1-hanchuanhua@xxxxxxxx/ > >>>>> [3] https://lore.kernel.org/all/1505886205-9671-5-git-send-email-minchan@xxxxxxxxxx/T/#u > >>>>> [4] https://lwn.net/Articles/955575/ > >>>>> > >>>>> Usama Arif (4): > >>>>> mm/zswap: skip swapcache for swapping in zswap pages > >>>>> mm/zswap: modify zswap_decompress to accept page instead of folio > >>>>> mm/zswap: add support for large folio zswapin > >>>>> mm/zswap: count successful large folio zswap loads > >>>>> > >>>>> Documentation/admin-guide/mm/transhuge.rst | 3 + > >>>>> include/linux/huge_mm.h | 1 + > >>>>> include/linux/zswap.h | 6 ++ > >>>>> mm/huge_memory.c | 3 + > >>>>> mm/memory.c | 16 +-- > >>>>> mm/page_io.c | 2 +- > >>>>> mm/zswap.c | 120 ++++++++++++++------- > >>>>> 7 files changed, 99 insertions(+), 52 deletions(-) > >>>>> > >>>>> -- > >>>>> 2.43.5 > >>>>> > >>>> > > Thanks Barry