> -----Original Message----- > From: Nhat Pham <nphamcs@xxxxxxxxx> > Sent: Tuesday, September 24, 2024 10:34 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx> > Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; > hannes@xxxxxxxxxxx; yosryahmed@xxxxxxxxxx; > chengming.zhou@xxxxxxxxx; usamaarif642@xxxxxxxxx; > shakeel.butt@xxxxxxxxx; ryan.roberts@xxxxxxx; Huang, Ying > <ying.huang@xxxxxxxxx>; 21cnbao@xxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; > Zou, Nanhai <nanhai.zou@xxxxxxxxx>; Feghali, Wajdi K > <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx> > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in > zswap_store(). > > On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar > <kanchana.p.sridhar@xxxxxxxxx> wrote: > > > > zswap_store() will now store mTHP and PMD-size THP folios by compressing > > them page by page. > > > > This patch provides a sequential implementation of storing an mTHP in > > zswap_store() by iterating through each page in the folio to compress > > and store it in the zswap zpool. > > > > Towards this goal, zswap_compress() is modified to take a page instead > > of a folio as input. > > > > Each page's swap offset is stored as a separate zswap entry. > > > > If an error is encountered during the store of any page in the mTHP, > > all previous pages/entries stored will be invalidated. Thus, an mTHP > > is either entirely stored in ZSWAP, or entirely not stored in ZSWAP. > > > > This forms the basis for building batching of pages during zswap store > > of large folios by compressing batches of up to say, 8 pages in an > > mTHP in parallel in hardware, with the Intel In-Memory Analytics > > Accelerator (Intel IAA). > > > > A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by > default) > > will enable/disable zswap storing of (m)THP. The corresponding tunable > > zswap module parameter is "mthp_enabled". > > > > This change reuses and adapts the functionality in Ryan Roberts' RFC > > patch [1]: > > > > "[RFC,v1] mm: zswap: Store large folios without splitting" > > > > [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1- > ryan.roberts@xxxxxxx/T/#u > > > > Also, addressed some of the RFC comments from the discussion in [1]. > > > > Co-developed-by: Ryan Roberts > > Signed-off-by: > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@xxxxxxxxx> > > --- > > mm/Kconfig | 8 ++++ > > mm/zswap.c | 122 +++++++++++++++++++++++++---------------------------- > > 2 files changed, 66 insertions(+), 64 deletions(-) > > > > diff --git a/mm/Kconfig b/mm/Kconfig > > index 09aebca1cae3..c659fb732ec4 100644 > > --- a/mm/Kconfig > > +++ b/mm/Kconfig > > @@ -59,6 +59,14 @@ config ZSWAP_SHRINKER_DEFAULT_ON > > reducing the chance that cold pages will reside in the zswap pool > > and consume memory indefinitely. > > > > +config ZSWAP_STORE_THP_DEFAULT_ON > > + bool "Store mTHP and THP folios in zswap" > > + depends on ZSWAP > > + default n > > + help > > + If selected, zswap will process mTHP and THP folios by > > + compressing and storing each 4K page in the large folio. > > + > > choice > > prompt "Default compressor" > > depends on ZSWAP > > diff --git a/mm/zswap.c b/mm/zswap.c > > index 8f2e0ab34c84..16ab770546d6 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -127,6 +127,14 @@ static bool zswap_shrinker_enabled = > IS_ENABLED( > > CONFIG_ZSWAP_SHRINKER_DEFAULT_ON); > > module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, > 0644); > > > > +/* > > + * Enable/disable zswap processing of mTHP folios. > > + * For now, only zswap_store will process mTHP folios. > > + */ > > +static bool zswap_mthp_enabled = IS_ENABLED( > > + CONFIG_ZSWAP_STORE_THP_DEFAULT_ON); > > +module_param_named(mthp_enabled, zswap_mthp_enabled, bool, > 0644); > > + > > Hmm, so this is a runtime knob. Also, should this be zswap_thp_enabled? :) Agreed, zswap_thp_enabled is a better name. I will make the change in v8. More comments below as to the runtime knob. > > > bool zswap_is_enabled(void) > > { > > return zswap_enabled; > > @@ -1471,9 +1479,9 @@ static void zswap_delete_stored_offsets(struct > xarray *tree, > > * @objcg: The folio's objcg. > > * @pool: The zswap_pool to store the compressed data for the page. > > */ > > -static bool __maybe_unused zswap_store_page(struct folio *folio, long > index, > > - struct obj_cgroup *objcg, > > - struct zswap_pool *pool) > > +static bool zswap_store_page(struct folio *folio, long index, > > + struct obj_cgroup *objcg, > > + struct zswap_pool *pool) > > { > > swp_entry_t swp = folio->swap; > > int type = swp_type(swp); > > @@ -1551,51 +1559,63 @@ static bool __maybe_unused > zswap_store_page(struct folio *folio, long index, > > return false; > > } > > > > +/* > > + * Modified to store mTHP folios. Each page in the mTHP will be > compressed > > + * and stored sequentially. > > + */ > > bool zswap_store(struct folio *folio) > > { > > long nr_pages = folio_nr_pages(folio); > > swp_entry_t swp = folio->swap; > > pgoff_t offset = swp_offset(swp); > > struct xarray *tree = swap_zswap_tree(swp); > > - struct zswap_entry *entry; > > struct obj_cgroup *objcg = NULL; > > struct mem_cgroup *memcg = NULL; > > + struct zswap_pool *pool; > > + bool ret = false; > > + long index; > > > > VM_WARN_ON_ONCE(!folio_test_locked(folio)); > > VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); > > > > - /* Large folios aren't supported */ > > - if (folio_test_large(folio)) > > + /* Storing large folios isn't enabled */ > > + if (!zswap_mthp_enabled && folio_test_large(folio)) > > return false; > > Hmm can this go wrong somehow? Can we have a case where we enable > zswap_mthp_enabled, have a large folio written to zswap, disable > zswap_mthp_enabled, and attempt to store that folio to zswap again. > > Now, we have a stale copy in zswap that is not invalidated...? > > Or am I missing something here :) This is an excellent point. Thanks Nhat for catching this! I can see two options to solving this: Option 1: If zswap_mthp_enabled is "false", delete all stored offsets for the mTHP in zswap before exiting. This could race with writeback (either one or more subpages could be written back before zswap_store acquires the tree lock), however, I don't think it will cause data inconsistencies. Any offsets for subpages not written back will be deleted from zswap, zswap_store() will return false, and the backing swap device's subsequent swapout will over-write the zswap write-back data. Could anything go wrong with this? Option 2: Only provide a build config option, CONFIG_ZSWAP_STORE_THP_DEFAULT_ON, that cannot be dynamically changed. Would appreciate suggestions on these, and other potential solutions. Thanks, Kanchana