RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios

"Sridhar, Kanchana P" <kanchana.p.sridhar@xxxxxxxxx> · Mon, 19 Aug 2024 05:12:53 +0000

Hi Ying,

> -----Original Message-----
> From: Huang, Ying <ying.huang@xxxxxxxxx>
> Sent: Sunday, August 18, 2024 8:17 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
> Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
> hannes@xxxxxxxxxxx; yosryahmed@xxxxxxxxxx; nphamcs@xxxxxxxxx;
> ryan.roberts@xxxxxxx; 21cnbao@xxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx;
> Zou, Nanhai <nanhai.zou@xxxxxxxxx>; Feghali, Wajdi K
> <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx>
> Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> 
> Kanchana P Sridhar <kanchana.p.sridhar@xxxxxxxxx> writes:
> 
> [snip]
> 
> >
> > Performance Testing:
> > ====================
> > Testing of this patch-series was done with the v6.11-rc3 mainline, without
> > and with this patch-series, on an Intel Sapphire Rapids server,
> > dual-socket 56 cores per socket, 4 IAA devices per socket.
> >
> > The system has 503 GiB RAM, with a 4G SSD as the backing swap device for
> > ZSWAP. Core frequency was fixed at 2500MHz.
> >
> > The vm-scalability "usemem" test was run in a cgroup whose memory.high
> > was fixed. Following a similar methodology as in Ryan Roberts'
> > "Swap-out mTHP without splitting" series [2], 70 usemem processes were
> > run, each allocating and writing 1G of memory:
> >
> >     usemem --init-time -w -O -n 70 1g
> >
> > Since I was constrained to get the 70 usemem processes to generate
> > swapout activity with the 4G SSD, I ended up using different cgroup
> > memory.high fixed limits for the experiments with 64K mTHP and 2M THP:
> >
> > 64K mTHP experiments: cgroup memory fixed at 60G
> > 2M THP experiments  : cgroup memory fixed at 55G
> >
> > The vm/sysfs stats included after the performance data provide details
> > on the swapout activity to SSD/ZSWAP.
> >
> > Other kernel configuration parameters:
> >
> >     ZSWAP Compressor  : LZ4, DEFLATE-IAA
> >     ZSWAP Allocator   : ZSMALLOC
> >     SWAP page-cluster : 2
> >
> > In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
> > IAA "compression verification" is enabled. Hence each IAA compression
> > will be decompressed internally by the "iaa_crypto" driver, the crc-s
> > returned by the hardware will be compared and errors reported in case of
> > mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> > compared to the software compressors.
> >
> > Throughput reported by usemem and perf sys time for running the test
> > are as follows, averaged across 3 runs:
> >
> >  64KB mTHP (cgroup memory.high set to 60G):
> >  ==========================================
> >   ------------------------------------------------------------------
> >  |                    |                   |            |            |
> >  |Kernel              | mTHP SWAP-OUT     | Throughput | Improvement|
> >  |                    |                   |       KB/s |            |
> >  |--------------------|-------------------|------------|------------|
> >  |v6.11-rc3 mainline  | SSD               |    335,346 |   Baseline |
> >  |zswap-mTHP-Store    | ZSWAP lz4         |    271,558 |       -19% |
> 
> zswap throughput is worse than ssd swap?  This doesn't look right.

I realize it might look that way, however, this is not an apples-to-apples comparison,
as explained in the latter part of my analysis (after the 2M THP data tables).
The primary reason for this is because of running the test under a fixed
cgroup memory limit.

In the "Before" scenario, mTHP get swapped out to SSD. However, the disk swap
usage is not accounted towards checking if the cgroup's memory limit has been
exceeded. Hence there are relatively fewer swap-outs, resulting mainly from the
1G allocations from each of the 70 usemem processes working with a 60G memory
limit on the parent cgroup.

However, the picture changes in the "After" scenario. mTHPs will now get stored in
zswap, which is accounted for in the cgroup's memory.current and counts
towards the fixed memory limit in effect for the parent cgroup. As a result, when
mTHP get stored in zswap, the mTHP compressed data in the zswap zpool now
count towards the cgroup's active memory and memory limit. This is in addition
to the 1G allocations from each of the 70 processes.

As you can see, this creates more memory pressure on the cgroup, resulting in
more swap-outs. With lz4 as the zswap compressor, this results in lesser throughput
wrt "Before".

However, with IAA as the zswap compressor, the throughout with zswap mTHP is
better than "Before" because of better hardware compress latencies, which handle
the higher swap-out activity without compromising on throughput.

> 
> >  |zswap-mTHP-Store    | ZSWAP deflate-iaa |    388,154 |        16% |
> >  |------------------------------------------------------------------|
> >  |                    |                   |            |            |
> >  |Kernel              | mTHP SWAP-OUT     |   Sys time | Improvement|
> >  |                    |                   |        sec |            |
> >  |--------------------|-------------------|------------|------------|
> >  |v6.11-rc3 mainline  | SSD               |      91.37 |   Baseline |
> >  |zswap-mTHP=Store    | ZSWAP lz4         |     265.43 |      -191% |
> >  |zswap-mTHP-Store    | ZSWAP deflate-iaa |     235.60 |      -158% |
> >   ------------------------------------------------------------------
> >
> >   -----------------------------------------------------------------------
> >  | VMSTATS, mTHP ZSWAP/SSD stats|  v6.11-rc3 |  zswap-mTHP |  zswap-
> mTHP |
> >  |                              |   mainline |       Store |       Store |
> >  |                              |            |         lz4 | deflate-iaa |
> >  |-----------------------------------------------------------------------|
> >  | pswpin                       |          0 |           0 |           0 |
> >  | pswpout                      |    174,432 |           0 |           0 |
> >  | zswpin                       |        703 |         534 |         721 |
> >  | zswpout                      |      1,501 |   1,491,654 |   1,398,805 |
> 
> It appears that the number of swapped pages for zswap is much larger
> than that of SSD swap.  Why?  I guess this is why zswap throughput is
> worse.

Your observation is correct. I hope the above explanation helps as to the
reasoning behind this.

Thanks,
Kanchana

> 
> >  |-----------------------------------------------------------------------|
> >  | thp_swpout                   |          0 |           0 |           0 |
> >  | thp_swpout_fallback          |          0 |           0 |           0 |
> >  | pgmajfault                   |      3,364 |       3,650 |       3,431 |
> >  |-----------------------------------------------------------------------|
> >  | hugepages-64kB/stats/zswpout |            |      63,200 |      63,244 |
> >  |-----------------------------------------------------------------------|
> >  | hugepages-64kB/stats/swpout  |     10,902 |           0 |           0 |
> >   -----------------------------------------------------------------------
> >
> 
> [snip]
> 
> --
> Best Regards,
> Huang, Ying