Re: [PATCH v7 2/2] mm: support large folios swap-in for sync io devices

Barry Song <21cnbao@xxxxxxxxx> · Tue, 27 Aug 2024 07:46:19 +1200

On Sat, Aug 24, 2024 at 5:56 AM Shakeel Butt <shakeel.butt@xxxxxxxxx> wrote:
>
> Hi Barry,
>
> On Thu, Aug 22, 2024 at 05:13:06AM GMT, Barry Song wrote:
> > On Thu, Aug 22, 2024 at 1:31 AM Shakeel Butt <shakeel.butt@xxxxxxxxx> wrote:
> > >
> > > On Wed, Aug 21, 2024 at 03:45:40PM GMT, hanchuanhua@xxxxxxxx wrote:
> > > > From: Chuanhua Han <hanchuanhua@xxxxxxxx>
> > > >
> > > >
> > > > 3. With both mTHP swap-out and swap-in supported, we offer the option to enable
> > > >    zsmalloc compression/decompression with larger granularity[2]. The upcoming
> > > >    optimization in zsmalloc will significantly increase swap speed and improve
> > > >    compression efficiency. Tested by running 100 iterations of swapping 100MiB
> > > >    of anon memory, the swap speed improved dramatically:
> > > >                 time consumption of swapin(ms)   time consumption of swapout(ms)
> > > >      lz4 4k                  45274                    90540
> > > >      lz4 64k                 22942                    55667
> > > >      zstdn 4k                85035                    186585
> > > >      zstdn 64k               46558                    118533
> > >
> > > Are the above number with the patch series at [2] or without? Also can
> > > you explain your experiment setup or how can someone reproduce these?
> >
> > Hi Shakeel,
> >
> > The data was recorded after applying both this patch (swap-in mTHP) and
> > patch [2] (compressing/decompressing mTHP instead of page). However,
> > without the swap-in series, patch [2] becomes useless because:
> >
> > If we have a large object, such as 16 pages in zsmalloc:
> > do_swap_page will happen 16 times:
> > 1. decompress the whole large object and copy one page;
> > 2. decompress the whole large object and copy one page;
> > 3. decompress the whole large object and copy one page;
> > ....
> > 16.  decompress the whole large object and copy one page;
> >
> > So, patchset [2] will actually degrade performance rather than
> > enhance it if we don't have this swap-in series. This swap-in
> > series is a prerequisite for the zsmalloc/zram series.
>
> Thanks for the explanation.
>
> >
> > We reproduced the data through the following simple steps:
> > 1. Collected anonymous pages from a running phone and saved them to a file.
> > 2. Used a small program to open and read the file into a mapped anonymous
> > memory.
> > 3.  Do the belows in the small program:
> > swapout_start_time
> > madv_pageout()
> > swapout_end_time
> >
> > swapin_start_time
> > read_data()
> > swapin_end_time
> >
> > We calculate the throughput of swapout and swapin using the difference between
> > end_time and start_time. Additionally, we record the memory usage of zram after
> > the swapout is complete.
> >
>
> Please correct me if I am wrong but you are saying in your experiment,
> 100 MiB took 90540 ms to compress/swapout and 45274 ms to
> decompress/swapin if backed by 4k pages but took 55667 ms and 22942 ms
> if backed by 64k pages. Basically the table shows total time to compress
> or decomress 100 MiB of memory, right?

Hi Shakeel,
Tangquan(CC'd) collected the data and double-checked the case to confirm
the answer to your question.

We have three cases:
1. no mTHP swap-in, no zsmalloc/zram multi-pages compression/decompression
2. have mTHP swap-in, no zsmalloc/zram multi-pages compression/decompression
3. have mTHP swap-in, have zsmalloc/zram multi-pages compression/decompression

The data was 1 vs 3.

To provide more precise data that covers each change, Tangquan tested
1 vs. 2 and
2 vs. 3 yesterday using LZ4 (the hardware might differ from the
previous test, but the
data shows the same trend) per my request.

1. no mTHP swapin, no zsmalloc/zram patch
swapin_ms.   30336
swapout_ms. 65651

2. have mTHP swapin, no zsmalloc/zram patch
swapin_ms.   27161
swapout_ms. 61135

3. have mTHP swapin, have zsmalloc/zram patch
swapin_ms.   13683
swapout_ms. 43305

The test pseudocode is as follows:

addr=mmap(100M)
read_anon_data_from_file_to addr();

for(i=0;i<100;i++) {
      swapout_start_time;
      madv_pageout();
      swapout_end_time;
      swapin_start_time;
      read_addr_to_swapin();
      swapin_end_time;
}

So, while we saw some improvement from 1 to 2, the significant gains
come from using large blocks for compression and decompression.

This mTHP swap-in series ensures that mTHPs aren't lost after the first swap-in,
so the following 99 iterations continue to involve THP swap-out and
mTHP swap-in.
The improvement from 1 to 2 is due to this mTHP swap-in series, while the
improvement from 2 to 3 comes from the zsmalloc/zram patchset [2] you
mentioned.

[2] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@xxxxxxxxx/

> > >
> >

Thanks
Barry