Re: [RFC PATCH v1 6/7] mm: do_swap_page() calls swapin_readahead() zswap load batching interface.

Usama Arif <usamaarif642@xxxxxxxxx> · Sun, 20 Oct 2024 17:50:25 +0100

On 18/10/2024 22:59, Sridhar, Kanchana P wrote:
> Hi Usama, Nhat,
> 
>> -----Original Message-----
>> From: Nhat Pham <nphamcs@xxxxxxxxx>
>> Sent: Friday, October 18, 2024 10:21 AM
>> To: Usama Arif <usamaarif642@xxxxxxxxx>
>> Cc: David Hildenbrand <david@xxxxxxxxxx>; Sridhar, Kanchana P
>> <kanchana.p.sridhar@xxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; linux-
>> mm@xxxxxxxxx; hannes@xxxxxxxxxxx; yosryahmed@xxxxxxxxxx;
>> chengming.zhou@xxxxxxxxx; ryan.roberts@xxxxxxx; Huang, Ying
>> <ying.huang@xxxxxxxxx>; 21cnbao@xxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx;
>> hughd@xxxxxxxxxx; willy@xxxxxxxxxxxxx; bfoster@xxxxxxxxxx;
>> dchinner@xxxxxxxxxx; chrisl@xxxxxxxxxx; Feghali, Wajdi K
>> <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx>
>> Subject: Re: [RFC PATCH v1 6/7] mm: do_swap_page() calls
>> swapin_readahead() zswap load batching interface.
>>
>> On Fri, Oct 18, 2024 at 4:04 AM Usama Arif <usamaarif642@xxxxxxxxx>
>> wrote:
>>>
>>>
>>> On 18/10/2024 08:26, David Hildenbrand wrote:
>>>> On 18.10.24 08:48, Kanchana P Sridhar wrote:
>>>>> This patch invokes the swapin_readahead() based batching interface to
>>>>> prefetch a batch of 4K folios for zswap load with batch decompressions
>>>>> in parallel using IAA hardware. swapin_readahead() prefetches folios
>> based
>>>>> on vm.page-cluster and the usefulness of prior prefetches to the
>>>>> workload. As folios are created in the swapcache and the readahead
>> code
>>>>> calls swap_read_folio() with a "zswap_batch" and a "non_zswap_batch",
>> the
>>>>> respective folio_batches get populated with the folios to be read.
>>>>>
>>>>> Finally, the swapin_readahead() procedures will call the newly added
>>>>> process_ra_batch_of_same_type() which:
>>>>>
>>>>>   1) Reads all the non_zswap_batch folios sequentially by calling
>>>>>      swap_read_folio().
>>>>>   2) Calls swap_read_zswap_batch_unplug() with the zswap_batch which
>> calls
>>>>>      zswap_finish_load_batch() that finally decompresses each
>>>>>      SWAP_CRYPTO_SUB_BATCH_SIZE sub-batch (i.e. upto 8 pages in a
>> prefetch
>>>>>      batch of say, 32 folios) in parallel with IAA.
>>>>>
>>>>> Within do_swap_page(), we try to benefit from batch decompressions in
>> both
>>>>> these scenarios:
>>>>>
>>>>>   1) single-mapped, SWP_SYNCHRONOUS_IO:
>>>>>        We call swapin_readahead() with "single_mapped_path = true". This
>> is
>>>>>        done only in the !zswap_never_enabled() case.
>>>>>   2) Shared and/or non-SWP_SYNCHRONOUS_IO folios:
>>>>>        We call swapin_readahead() with "single_mapped_path = false".
>>>>>
>>>>> This will place folios in the swapcache: a design choice that handles
>> cases
>>>>> where a folio that is "single-mapped" in process 1 could be prefetched in
>>>>> process 2; and handles highly contended server scenarios with stability.
>>>>> There are checks added at the end of do_swap_page(), after the folio has
>>>>> been successfully loaded, to detect if the single-mapped swapcache folio
>> is
>>>>> still single-mapped, and if so, folio_free_swap() is called on the folio.
>>>>>
>>>>> Within the swapin_readahead() functions, if single_mapped_path is true,
>> and
>>>>> either the platform does not have IAA, or, if the platform has IAA and the
>>>>> user selects a software compressor for zswap (details of sysfs knob
>>>>> follow), readahead/batching are skipped and the folio is loaded using
>>>>> zswap_load().
>>>>>
>>>>> A new swap parameter "singlemapped_ra_enabled" (false by default) is
>> added
>>>>> for platforms that have IAA, zswap_load_batching_enabled() is true, and
>> we
>>>>> want to give the user the option to run experiments with IAA and with
>>>>> software compressors for zswap (swap device is
>> SWP_SYNCHRONOUS_IO):
>>>>>
>>>>> For IAA:
>>>>>   echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled
>>>>>
>>>>> For software compressors:
>>>>>   echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled
>>>>>
>>>>> If "singlemapped_ra_enabled" is set to false, swapin_readahead() will
>> skip
>>>>> prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO"
>> do_swap_page()
>>>>> path.
>>>>>
>>>>> Thanks Ying Huang for the really helpful brainstorming discussions on the
>>>>> swap_read_folio() plug design.
>>>>>
>>>>> Suggested-by: Ying Huang <ying.huang@xxxxxxxxx>
>>>>> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@xxxxxxxxx>
>>>>> ---
>>>>>   mm/memory.c     | 187 +++++++++++++++++++++++++++++++++++++--
>> ---------
>>>>>   mm/shmem.c      |   2 +-
>>>>>   mm/swap.h       |  12 ++--
>>>>>   mm/swap_state.c | 157 ++++++++++++++++++++++++++++++++++++--
>> --
>>>>>   mm/swapfile.c   |   2 +-
>>>>>   5 files changed, 299 insertions(+), 61 deletions(-)
>>>>>
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index b5745b9ffdf7..9655b85fc243 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -3924,6 +3924,42 @@ static vm_fault_t
>> remove_device_exclusive_entry(struct vm_fault *vmf)
>>>>>       return 0;
>>>>>   }
>>>>>   +/*
>>>>> + * swapin readahead based batching interface for zswap batched loads
>> using IAA:
>>>>> + *
>>>>> + * Should only be called for and if the faulting swap entry in
>> do_swap_page
>>>>> + * is single-mapped and SWP_SYNCHRONOUS_IO.
>>>>> + *
>>>>> + * Detect if the folio is in the swapcache, is still mapped to only this
>>>>> + * process, and further, there are no additional references to this folio
>>>>> + * (for e.g. if another process simultaneously readahead this swap entry
>>>>> + * while this process was handling the page-fault, and got a pointer to
>> the
>>>>> + * folio allocated by this process in the swapcache), besides the
>> references
>>>>> + * that were obtained within __read_swap_cache_async() by this
>> process that is
>>>>> + * faulting in this single-mapped swap entry.
>>>>> + */
>>>>
>>>> How is this supposed to work for large folios?
>>>>
>>>
>>> Hi,
>>>
>>> I was looking at zswapin large folio support and have posted a RFC in [1].
>>> I got bogged down with some prod stuff, so wasn't able to send it earlier.
>>>
>>> It looks quite different, and I think simpler from this series, so might be
>>> a good comparison.
>>>
>>> [1] https://lore.kernel.org/all/20241018105026.2521366-1-
>> usamaarif642@xxxxxxxxx/
>>>
>>> Thanks,
>>> Usama
>>
>> I agree.
>>
>> I think the lower hanging fruit here is to build upon Usama's patch.
>> Kanchana, do you think we can just use the new batch decompressing
>> infrastructure, and apply it to Usama's large folio zswap loading?
>>
>> I'm not denying the readahead idea outright, but that seems much more
>> complicated. There are questions regarding the benefits of
>> readahead-ing when apply to zswap in the first place - IIUC, zram
>> circumvents that logic in several cases, and zswap shares many
>> characteristics with zram (fast, synchronous compression devices).
>>
>> So let's reap the low hanging fruits first, get the wins as well as
>> stress test the new infrastructure. Then we can discuss the readahead
>> idea later?
> 
> Thanks Usama for publishing the zswap large folios swapin series, and
> thanks Nhat for your suggestions.  Sure, I can look into integrating the
> new batch decompressing infrastructure with Usama's large folio zswap
> loading.
> 
> However, I think we need to get clarity on a bigger question: does it
> make sense to swapin large folios? Some important considerations
> would be:
> 
> 1) What are the tradeoffs in memory footprint cost of swapping in a
>     large folio?

I would say the pros and cons of swapping in a large folio are the same as
the pros and cons of large folios in general.
As mentioned in my cover letter and the series that introduced large folios
you get fewer page faults, batched PTE and rmap manipulation, reduced lru list,
TLB coalescing (for arm64 and AMD) at the cost of higher memory usage and
fragmentation.

The other thing is that the series I wrote is hopefully just a start.
As shown by Barry in the case of zram in 
https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@xxxxxxxxx/
there is a significant improvement in CPU utilization and compression
ratios when compressing at large granularity. Hopefully we can
try and do something similar for zswap. Not sure how that would look
for zswap as I haven't started looking at that yet.

> 2) If we decide to let the user determine this by say, an option that
>      determines the swapin granularity (e.g. no more than 32k at a time),
>      how does this constrain compression and zpool storage granularity?
> 
Right now whether or not zswapin happens is determined using
/sys/kernel/mm/transparent_hugepage/hugepages-*kB/enabled
I assume when the someone sets some of these to always, they know that
their workload works best with those page sizes, so they would want folios
to be swapped in and used at that size as well?

There might be some merit in adding something like
/sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled,
as you might start thrashing swap if you are for e.g. swapping
in 1M folios and there isn't enough memory for it, which causes
you to swapout another folio in its place.

> Ultimately, I feel the bigger question is about memory utilization cost
> of large folio swapin. The swapin_readahead() based approach tries to
> use the prefetch-usefulness characteristics of the workload to improve
> the efficiency of multiple 4k folios by using strategies like parallel
> decompression, to strike some balance in memory utilization vs.
> efficiency.
> 
> Usama, I downloaded your patch-series and tried to understand this
> better, and wanted to share the data.
> 
> I ran the kernel compilation "allmodconfig" with zstd, page-cluster=0,
> and 16k/32k/64k large folios enabled to "always":
> 
> 16k/32k/64k folios: kernel compilation with zstd:
>  =================================================
> 
>  ------------------------------------------------------------------------------
>                         mm-unstable-10-16-2024    + zswap large folios swapin
>                                                                        series
>  ------------------------------------------------------------------------------
>  zswap compressor                         zstd                           zstd
>  vm.page-cluster                             0                              0
>  ------------------------------------------------------------------------------
>  real_sec                               772.53                         870.61
>  user_sec                            15,780.29                      15,836.71
>  sys_sec                              5,353.20                       6,185.02
>  Max_Res_Set_Size_KB                 1,873,348                      1,873,004
>                                                                              
>  ------------------------------------------------------------------------------
>  memcg_high                                  0                              0
>  memcg_swap_fail                             0                              0
>  zswpout                            93,811,916                    111,663,872
>  zswpin                             27,150,029                     54,730,678
>  pswpout                                    64                             59
>  pswpin                                     78                             53
>  thp_swpout                                  0                              0
>  thp_swpout_fallback                         0                              0
>  16kB-mthp_swpout_fallback                   0                              0
>  32kB-mthp_swpout_fallback                   0                              0
>  64kB-mthp_swpout_fallback               5,470                              0
>  pgmajfault                         29,019,256                     16,615,820
>  swap_ra                                     0                              0
>  swap_ra_hit                             3,004                          3,614
>  ZSWPOUT-16kB                        1,324,160                      2,252,747
>  ZSWPOUT-32kB                          730,534                      1,356,640
>  ZSWPOUT-64kB                        3,039,760                      3,955,034
>  ZSWPIN-16kB                                                        1,496,916
>  ZSWPIN-32kB                                                        1,131,176
>  ZSWPIN-64kB                                                        1,866,884
>  SWPOUT-16kB                                 0                              0
>  SWPOUT-32kB                                 0                              0
>  SWPOUT-64kB                                 4                              3
>  ------------------------------------------------------------------------------
> 
> It does appear like there is considerably higher swapout and swapin
> activity as a result of swapping in large folios, which does end up
> impacting performance.

Thanks for having a look!
I had only tested with the microbenchmark for time taken to zswapin that I included in
my coverletter.
In general I expected zswap activity to go up as you are more likely to experience
memory pressure when swapping in large folios, but in return get lower pagefaults
and the advantage of lower TLB pressure in AMD and arm64.

Thanks for the test, those number look quite extreme! I think there is a lot of swap
thrashing. 
I am assuming you are testing on an intel machine, where you don't get the advantage
of lower TLB misses of large folios, I will try and get an AMD machine which has
TLB coalescing or an ARM server with CONT_PTE to see if the numbers get better.

Maybe it might be better for large folio zswapin to be considered along with
larger granuality compression to get all the benefits of large folios (and
hopefully better numbers.) I think that was the approach taken for zram as well.

> 
> I would appreciate thoughts on understanding the usefulness of
> swapping in large folios, with the considerations outlined earlier/other
> factors.
> 
> Thanks,
> Kanchana