On 08/11/2024 06:51, Barry Song wrote: > On Fri, Nov 8, 2024 at 6:23 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: >> >> Hi, Barry, >> >> Barry Song <21cnbao@xxxxxxxxx> writes: >> >>> From: Barry Song <v-songbaohua@xxxxxxxx> >>> >>> When large folios are compressed at a larger granularity, we observe >>> a notable reduction in CPU usage and a significant improvement in >>> compression ratios. >>> >>> mTHP's ability to be swapped out without splitting and swapped back in >>> as a whole allows compression and decompression at larger granularities. >>> >>> This patchset enhances zsmalloc and zram by adding support for dividing >>> large folios into multi-page blocks, typically configured with a >>> 2-order granularity. Without this patchset, a large folio is always >>> divided into `nr_pages` 4KiB blocks. >>> >>> The granularity can be set using the `ZSMALLOC_MULTI_PAGES_ORDER` >>> setting, where the default of 2 allows all anonymous THP to benefit. >>> >>> Examples include: >>> * A 16KiB large folio will be compressed and stored as a single 16KiB >>> block. >>> * A 64KiB large folio will be compressed and stored as four 16KiB >>> blocks. >>> >>> For example, swapping out and swapping in 100MiB of typical anonymous >>> data 100 times (with 16KB mTHP enabled) using zstd yields the following >>> results: >>> >>> w/o patches w/ patches >>> swap-out time(ms) 68711 49908 >>> swap-in time(ms) 30687 20685 >>> compression ratio 20.49% 16.9% >> >> The data looks good. Thanks! >> >> Have you considered the situation that the large folio fails to be >> allocated during swap-in? It's possible because the memory may be very >> fragmented. > > That's correct, good question. On phones, we use a large folio pool to maintain > a relatively high allocation success rate. When mTHP allocation fails, we have > a workaround to allocate nr_pages of small folios and map them together to > avoid partial reads. This ensures that the benefits of larger block compression > and decompression are consistently maintained. That was the code running > on production phones. > Thanks for sending the v2! How is the large folio pool maintained. I dont think there is something in upstream kernel for this? The only thing that I saw on the mailing list is TAO for pmd-mappable THPs only? I think that was about 7-8 months ago and wasn't merged? The workaround to allocate nr_pages of small folios and map them together to avoid partial reads is also not upstream, right? Do you have any data how this would perform with the upstream kernel, i.e. without a large folio pool and the workaround and if large granularity compression is worth having without those patches? Thanks, Usama > We also previously experimented with maintaining multiple buffers for > decompressed > large blocks in zRAM, allowing upcoming do_swap_page() calls to use them when > falling back to small folios. In this setup, the buffers achieved a > high hit rate, though > I don’t recall the exact number. > > I'm concerned that this fault-around-like fallback to nr_pages small > folios may not > gain traction upstream. Do you have any suggestions for improvement? > >> >>> -v2: >>> While it is not mature yet, I know some people are waiting for >>> an update :-) >>> * Fixed some stability issues. >>> * rebase againest the latest mm-unstable. >>> * Set default order to 2 which benefits all anon mTHP. >>> * multipages ZsPageMovable is not supported yet. >>> >>> Tangquan Zheng (2): >>> mm: zsmalloc: support objects compressed based on multiple pages >>> zram: support compression at the granularity of multi-pages >>> >>> drivers/block/zram/Kconfig | 9 + >>> drivers/block/zram/zcomp.c | 17 +- >>> drivers/block/zram/zcomp.h | 12 +- >>> drivers/block/zram/zram_drv.c | 450 +++++++++++++++++++++++++++++++--- >>> drivers/block/zram/zram_drv.h | 45 ++++ >>> include/linux/zsmalloc.h | 10 +- >>> mm/Kconfig | 18 ++ >>> mm/zsmalloc.c | 232 +++++++++++++----- >>> 8 files changed, 699 insertions(+), 94 deletions(-) >> >> -- >> Best Regards, >> Huang, Ying > > Thanks > barry