> -----Original Message----- > From: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > Sent: Tuesday, September 24, 2024 12:39 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx> > Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; > hannes@xxxxxxxxxxx; nphamcs@xxxxxxxxx; chengming.zhou@xxxxxxxxx; > usamaarif642@xxxxxxxxx; shakeel.butt@xxxxxxxxx; ryan.roberts@xxxxxxx; > Huang, Ying <ying.huang@xxxxxxxxx>; 21cnbao@xxxxxxxxx; akpm@linux- > foundation.org; Zou, Nanhai <nanhai.zou@xxxxxxxxx>; Feghali, Wajdi K > <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx> > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in > zswap_store(). > > On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar > <kanchana.p.sridhar@xxxxxxxxx> wrote: > > > > zswap_store() will now store mTHP and PMD-size THP folios by compressing > > them page by page. > > > > This patch provides a sequential implementation of storing an mTHP in > > zswap_store() by iterating through each page in the folio to compress > > and store it in the zswap zpool. > > > > Towards this goal, zswap_compress() is modified to take a page instead > > of a folio as input. > > > > Each page's swap offset is stored as a separate zswap entry. > > > > If an error is encountered during the store of any page in the mTHP, > > all previous pages/entries stored will be invalidated. Thus, an mTHP > > is either entirely stored in ZSWAP, or entirely not stored in ZSWAP. > > > > This forms the basis for building batching of pages during zswap store > > of large folios by compressing batches of up to say, 8 pages in an > > mTHP in parallel in hardware, with the Intel In-Memory Analytics > > Accelerator (Intel IAA). > > > > A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by > default) > > will enable/disable zswap storing of (m)THP. The corresponding tunable > > zswap module parameter is "mthp_enabled". > > > > This change reuses and adapts the functionality in Ryan Roberts' RFC > > patch [1]: > > > > "[RFC,v1] mm: zswap: Store large folios without splitting" > > > > [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1- > ryan.roberts@xxxxxxx/T/#u > > > > Also, addressed some of the RFC comments from the discussion in [1]. > > > > Co-developed-by: Ryan Roberts > > Signed-off-by: > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@xxxxxxxxx> > > --- > > mm/Kconfig | 8 ++++ > > mm/zswap.c | 122 +++++++++++++++++++++++++---------------------------- > > 2 files changed, 66 insertions(+), 64 deletions(-) > > > > diff --git a/mm/Kconfig b/mm/Kconfig > > index 09aebca1cae3..c659fb732ec4 100644 > > --- a/mm/Kconfig > > +++ b/mm/Kconfig > > @@ -59,6 +59,14 @@ config ZSWAP_SHRINKER_DEFAULT_ON > > reducing the chance that cold pages will reside in the zswap pool > > and consume memory indefinitely. > > > > +config ZSWAP_STORE_THP_DEFAULT_ON > > + bool "Store mTHP and THP folios in zswap" > > + depends on ZSWAP > > + default n > > + help > > + If selected, zswap will process mTHP and THP folios by > > + compressing and storing each 4K page in the large folio. > > + > > choice > > prompt "Default compressor" > > depends on ZSWAP > > diff --git a/mm/zswap.c b/mm/zswap.c > > index 8f2e0ab34c84..16ab770546d6 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -127,6 +127,14 @@ static bool zswap_shrinker_enabled = > IS_ENABLED( > > CONFIG_ZSWAP_SHRINKER_DEFAULT_ON); > > module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, > 0644); > > > > +/* > > + * Enable/disable zswap processing of mTHP folios. > > + * For now, only zswap_store will process mTHP folios. > > + */ > > +static bool zswap_mthp_enabled = IS_ENABLED( > > + CONFIG_ZSWAP_STORE_THP_DEFAULT_ON); > > +module_param_named(mthp_enabled, zswap_mthp_enabled, bool, > 0644); > > + > > bool zswap_is_enabled(void) > > { > > return zswap_enabled; > > @@ -1471,9 +1479,9 @@ static void zswap_delete_stored_offsets(struct > xarray *tree, > > * @objcg: The folio's objcg. > > * @pool: The zswap_pool to store the compressed data for the page. > > */ > > -static bool __maybe_unused zswap_store_page(struct folio *folio, long > index, > > - struct obj_cgroup *objcg, > > - struct zswap_pool *pool) > > +static bool zswap_store_page(struct folio *folio, long index, > > + struct obj_cgroup *objcg, > > + struct zswap_pool *pool) > > As I mentioned earlier, the patch that introduced zswap_store_page() > should have directly used it in zswap_store(). This would make this > patch much clearer. Sure. I will fix this in v8. > > > { > > swp_entry_t swp = folio->swap; > > int type = swp_type(swp); > > @@ -1551,51 +1559,63 @@ static bool __maybe_unused > zswap_store_page(struct folio *folio, long index, > > return false; > > } > > > > +/* > > + * Modified to store mTHP folios. Each page in the mTHP will be > compressed > > + * and stored sequentially. > > + */ > > bool zswap_store(struct folio *folio) > > { > > long nr_pages = folio_nr_pages(folio); > > swp_entry_t swp = folio->swap; > > pgoff_t offset = swp_offset(swp); > > struct xarray *tree = swap_zswap_tree(swp); > > - struct zswap_entry *entry; > > struct obj_cgroup *objcg = NULL; > > struct mem_cgroup *memcg = NULL; > > + struct zswap_pool *pool; > > + bool ret = false; > > + long index; > > > > VM_WARN_ON_ONCE(!folio_test_locked(folio)); > > VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); > > > > - /* Large folios aren't supported */ > > - if (folio_test_large(folio)) > > + /* Storing large folios isn't enabled */ > > The comment is now stating the obvious, please remove it. Ok. I suppose this check will also no longer be needed based on the config knob not being needed. > > > + if (!zswap_mthp_enabled && folio_test_large(folio)) > > return false; > > > > if (!zswap_enabled) > > - goto check_old; > > + goto reject; > > > > - /* Check cgroup limits */ > > + /* > > + * Check cgroup limits: > > + * > > + * The cgroup zswap limit check is done once at the beginning of an > > + * mTHP store, and not within zswap_store_page() for each page > > + * in the mTHP. We do however check the zswap pool limits at the > > + * start of zswap_store_page(). What this means is, the cgroup > > + * could go over the limits by at most (HPAGE_PMD_NR - 1) pages. > > + * However, the per-store-page zswap pool limits check should > > + * hopefully trigger the cgroup aware and zswap LRU aware global > > + * reclaim implemented in the shrinker. If this assumption holds, > > + * the cgroup exceeding the zswap limits could potentially be > > + * resolved before the next zswap_store, and if it is not, the next > > + * zswap_store would fail the cgroup zswap limit check at the start. > > + */ > > I do not really like this. Allowing going one page above the limit is > one thing, but one THP above the limit seems too much. I also don't > like relying on the repeated limit checking in zswap_store_page(), if > anything I think that should be batched too. > > Is it too unreasonable to maintain the average compression ratio and > use that to estimate limit checking for both memcg and global limits? > Johannes, Nhat, any thoughts on this? I see that Nhat has responded. Hopefully we can discuss this in the follow-up to Nhat's comments. Thanks, Kanchana > > > objcg = get_obj_cgroup_from_folio(folio); > > if (objcg && !obj_cgroup_may_zswap(objcg)) { > > memcg = get_mem_cgroup_from_objcg(objcg); > > if (shrink_memcg(memcg)) { > > mem_cgroup_put(memcg); > > - goto reject; > > + goto put_objcg; > > } > > mem_cgroup_put(memcg); > > } > > > > if (zswap_check_limits()) > > - goto reject; > > - > > - /* allocate entry */ > > - entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio)); > > - if (!entry) { > > - zswap_reject_kmemcache_fail++; > > - goto reject; > > - } > > + goto put_objcg; > > > > - /* if entry is successfully added, it keeps the reference */ > > - entry->pool = zswap_pool_current_get(); > > - if (!entry->pool) > > - goto freepage; > > + pool = zswap_pool_current_get(); > > + if (!pool) > > + goto put_objcg; > > > > if (objcg) { > > memcg = get_mem_cgroup_from_objcg(objcg); > > @@ -1606,60 +1626,34 @@ bool zswap_store(struct folio *folio) > > mem_cgroup_put(memcg); > > } > > > > - if (!zswap_compress(&folio->page, entry)) > > - goto put_pool; > > - > > - entry->swpentry = swp; > > - entry->objcg = objcg; > > - entry->referenced = true; > > - > > - if (!zswap_store_entry(tree, entry)) > > - goto store_failed; > > - > > - if (objcg) { > > - obj_cgroup_charge_zswap(objcg, entry->length); > > - count_objcg_event(objcg, ZSWPOUT); > > - } > > - > > /* > > - * We finish initializing the entry while it's already in xarray. > > - * This is safe because: > > - * > > - * 1. Concurrent stores and invalidations are excluded by folio lock. > > - * > > - * 2. Writeback is excluded by the entry not being on the LRU yet. > > - * The publishing order matters to prevent writeback from seeing > > - * an incoherent entry. > > + * Store each page of the folio as a separate entry. If we fail to store > > + * a page, unwind by removing all the previous pages we stored. > > */ > > - if (entry->length) { > > - INIT_LIST_HEAD(&entry->lru); > > - zswap_lru_add(&zswap_list_lru, entry); > > + for (index = 0; index < nr_pages; ++index) { > > + if (!zswap_store_page(folio, index, objcg, pool)) > > + goto put_pool; > > } > > > > - /* update stats */ > > - atomic_inc(&zswap_stored_pages); > > - count_vm_event(ZSWPOUT); > > - > > - return true; > > + ret = true; > > > > -store_failed: > > - zpool_free(entry->pool->zpool, entry->handle); > > put_pool: > > - zswap_pool_put(entry->pool); > > -freepage: > > - zswap_entry_cache_free(entry); > > -reject: > > + zswap_pool_put(pool); > > +put_objcg: > > obj_cgroup_put(objcg); > > if (zswap_pool_reached_full) > > queue_work(shrink_wq, &zswap_shrink_work); > > -check_old: > > +reject: > > /* > > - * If the zswap store fails or zswap is disabled, we must invalidate the > > - * possibly stale entry which was previously stored at this offset. > > - * Otherwise, writeback could overwrite the new data in the swapfile. > > + * If the zswap store fails or zswap is disabled, we must invalidate > > + * the possibly stale entries which were previously stored at the > > + * offsets corresponding to each page of the folio. Otherwise, > > + * writeback could overwrite the new data in the swapfile. > > */ > > - zswap_delete_stored_offsets(tree, offset, nr_pages); > > - return false; > > + if (!ret) > > + zswap_delete_stored_offsets(tree, offset, nr_pages); > > + > > + return ret; > > } > > > > bool zswap_load(struct folio *folio) > > -- > > 2.27.0 > >