On Fri, Jul 5, 2024 at 7:25 PM Takero Funaki <flintglass@xxxxxxxxx> wrote: > > To prevent the zswap global shrinker from writing back pages > simultaneously with IO performed for memory reclaim and faults, delay > the writeback when zswap_store() rejects pages or zswap_load() cannot > find entry in pool. > > When the zswap shrinker is running and zswap rejects an incoming page, > simulatenous zswap writeback and the rejected page lead to IO contention > on swap device. In this case, the writeback of the rejected page must be > higher priority as it is necessary for actual memory reclaim progress. > The zswap global shrinker can run in the background and should not > interfere with memory reclaim. Do you see this problem actually manifesting in real life? Does not sound infeasible to me, but I wonder how likely this is the case. Do you have any userspace-visible metrics, or any tracing logs etc. that proves that it actually happens? This might also affect the dynamic shrinker as well FWIW. > > The same logic applies to zswap_load(). When zswap cannot find requested > page from pool and read IO is performed, shrinker should be interrupted. > > To avoid IO contention, save the timestamp jiffies when zswap cannot > buffer the pagein/out IO and interrupt the global shrinker. The shrinker > resumes the writeback in 500 msec since the saved timestamp. > > Signed-off-by: Takero Funaki <flintglass@xxxxxxxxx> > --- > mm/zswap.c | 47 +++++++++++++++++++++++++++++++++++++++++++++-- > 1 file changed, 45 insertions(+), 2 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index def0f948a4ab..59ba4663c74f 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -35,6 +35,8 @@ > #include <linux/pagemap.h> > #include <linux/workqueue.h> > #include <linux/list_lru.h> > +#include <linux/delay.h> > +#include <linux/jiffies.h> > > #include "swap.h" > #include "internal.h" > @@ -176,6 +178,14 @@ static bool zswap_next_shrink_changed; > static struct work_struct zswap_shrink_work; > static struct shrinker *zswap_shrinker; > > +/* > + * To avoid IO contention between pagein/out and global shrinker writeback, > + * track the last jiffies of pagein/out and delay the writeback. > + * Default to 500msec in alignment with mq-deadline read timeout. If there is a future version, could you include the reason why you select 500msec in the patch's changelog as well? > + */ > +#define ZSWAP_GLOBAL_SHRINKER_DELAY_MS 500 > +static unsigned long zswap_shrinker_delay_start; > + > /* > * struct zswap_entry > * > @@ -244,6 +254,14 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp) > pr_debug("%s pool %s/%s\n", msg, (p)->tfm_name, \ > zpool_get_type((p)->zpools[0])) > > +static inline void zswap_shrinker_delay_update(void) > +{ > + unsigned long now = jiffies; > + > + if (now != zswap_shrinker_delay_start) > + zswap_shrinker_delay_start = now; > +} Hmmm is there a reason why we do not just do: zswap_shrinker_delay_start = jiffies; or, more unnecessarily: unsigned long now = jiffies; zswap_shrinker_delay_start = now; IOW, why the branching here? Does not seem necessary to me, but perhaps there is a correctness/compiler reason I'm not seeing? In fact, if it's the first version, then we could manually inline it. Additionally/alternatively, I wonder if it is more convenient to do it at the mm/page_io.c zswap callsites, i.e whenever zswap_store() and zswap_load() returns false, then delay the shrinker before proceeding with the IO steps. > + > /********************************* > * pool functions > **********************************/ > @@ -1378,6 +1396,8 @@ static void shrink_worker(struct work_struct *w) > struct mem_cgroup *memcg; > int ret, failures = 0, progress; > unsigned long thr; > + unsigned long now, sleepuntil; > + const unsigned long delay = msecs_to_jiffies(ZSWAP_GLOBAL_SHRINKER_DELAY_MS); > > /* Reclaim down to the accept threshold */ > thr = zswap_accept_thr_pages(); > @@ -1405,6 +1425,21 @@ static void shrink_worker(struct work_struct *w) > * until the next run of shrink_worker(). > */ > do { > + /* > + * delay shrinking to allow the last rejected page completes > + * its writeback > + */ > + sleepuntil = delay + READ_ONCE(zswap_shrinker_delay_start); I assume we do not care about racy access here right? Same goes for updates - I don't see any locks protecting these operations (but I could be missing something). > + now = jiffies; > + /* > + * If zswap did not reject pages for long, sleepuntil-now may > + * underflow. We assume the timestamp is valid only if > + * now < sleepuntil < now + delay + 1 > + */ > + if (time_before(now, sleepuntil) && > + time_before(sleepuntil, now + delay + 1)) > + fsleep(jiffies_to_usecs(sleepuntil - now)); > + > spin_lock(&zswap_shrink_lock); > > /* > @@ -1526,8 +1561,10 @@ bool zswap_store(struct folio *folio) > VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); > > /* Large folios aren't supported */ > - if (folio_test_large(folio)) > + if (folio_test_large(folio)) { > + zswap_shrinker_delay_update(); > return false; > + } > > if (!zswap_enabled) > goto check_old; > @@ -1648,6 +1685,8 @@ bool zswap_store(struct folio *folio) > zswap_entry_cache_free(entry); > reject: > obj_cgroup_put(objcg); > + zswap_shrinker_delay_update(); > + > if (need_global_shrink) > queue_work(shrink_wq, &zswap_shrink_work); > check_old: > @@ -1691,8 +1730,10 @@ bool zswap_load(struct folio *folio) > else > entry = xa_load(tree, offset); > > - if (!entry) > + if (!entry) { > + zswap_shrinker_delay_update(); > return false; > + } > > if (entry->length) > zswap_decompress(entry, page); > @@ -1835,6 +1876,8 @@ static int zswap_setup(void) > if (ret) > goto hp_fail; > > + zswap_shrinker_delay_update(); > + > shrink_wq = alloc_workqueue("zswap-shrink", > WQ_UNBOUND, 1); > if (!shrink_wq) > -- > 2.43.0 >