Re: [PATCH v2 6/6] mm: zswap: interrupt shrinker writeback while pagein/out IO

Nhat Pham <nphamcs@xxxxxxxxx> · Mon, 8 Jul 2024 12:17:12 -0700

On Fri, Jul 5, 2024 at 7:25 PM Takero Funaki <flintglass@xxxxxxxxx> wrote:
>
> To prevent the zswap global shrinker from writing back pages
> simultaneously with IO performed for memory reclaim and faults, delay
> the writeback when zswap_store() rejects pages or zswap_load() cannot
> find entry in pool.
>
> When the zswap shrinker is running and zswap rejects an incoming page,
> simulatenous zswap writeback and the rejected page lead to IO contention
> on swap device. In this case, the writeback of the rejected page must be
> higher priority as it is necessary for actual memory reclaim progress.
> The zswap global shrinker can run in the background and should not
> interfere with memory reclaim.

Do you see this problem actually manifesting in real life? Does not
sound infeasible to me, but I wonder how likely this is the case.

Do you have any userspace-visible metrics, or any tracing logs etc.
that proves that it actually happens?

This might also affect the dynamic shrinker as well FWIW.

>
> The same logic applies to zswap_load(). When zswap cannot find requested
> page from pool and read IO is performed, shrinker should be interrupted.
>
> To avoid IO contention, save the timestamp jiffies when zswap cannot
> buffer the pagein/out IO and interrupt the global shrinker. The shrinker
> resumes the writeback in 500 msec since the saved timestamp.
>
> Signed-off-by: Takero Funaki <flintglass@xxxxxxxxx>
> ---
>  mm/zswap.c | 47 +++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 45 insertions(+), 2 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index def0f948a4ab..59ba4663c74f 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -35,6 +35,8 @@
>  #include <linux/pagemap.h>
>  #include <linux/workqueue.h>
>  #include <linux/list_lru.h>
> +#include <linux/delay.h>
> +#include <linux/jiffies.h>
>
>  #include "swap.h"
>  #include "internal.h"
> @@ -176,6 +178,14 @@ static bool zswap_next_shrink_changed;
>  static struct work_struct zswap_shrink_work;
>  static struct shrinker *zswap_shrinker;
>
> +/*
> + * To avoid IO contention between pagein/out and global shrinker writeback,
> + * track the last jiffies of pagein/out and delay the writeback.
> + * Default to 500msec in alignment with mq-deadline read timeout.

If there is a future version, could you include the reason why you
select 500msec in the patch's changelog as well?

> + */
> +#define ZSWAP_GLOBAL_SHRINKER_DELAY_MS 500
> +static unsigned long zswap_shrinker_delay_start;
> +
>  /*
>   * struct zswap_entry
>   *
> @@ -244,6 +254,14 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
>         pr_debug("%s pool %s/%s\n", msg, (p)->tfm_name,         \
>                  zpool_get_type((p)->zpools[0]))
>
> +static inline void zswap_shrinker_delay_update(void)
> +{
> +       unsigned long now = jiffies;
> +
> +       if (now != zswap_shrinker_delay_start)
> +               zswap_shrinker_delay_start = now;
> +}

Hmmm is there a reason why we do not just do:

zswap_shrinker_delay_start = jiffies;

or, more unnecessarily:

unsigned long now = jiffies;

zswap_shrinker_delay_start = now;

IOW, why the branching here? Does not seem necessary to me, but
perhaps there is a correctness/compiler reason I'm not seeing?

In fact, if it's the first version, then we could manually inline it.

Additionally/alternatively, I wonder if it is more convenient to do it
at the mm/page_io.c zswap callsites, i.e whenever zswap_store() and
zswap_load() returns false, then delay the shrinker before proceeding
with the IO steps.

> +
>  /*********************************
>  * pool functions
>  **********************************/
> @@ -1378,6 +1396,8 @@ static void shrink_worker(struct work_struct *w)
>         struct mem_cgroup *memcg;
>         int ret, failures = 0, progress;
>         unsigned long thr;
> +       unsigned long now, sleepuntil;
> +       const unsigned long delay = msecs_to_jiffies(ZSWAP_GLOBAL_SHRINKER_DELAY_MS);
>
>         /* Reclaim down to the accept threshold */
>         thr = zswap_accept_thr_pages();
> @@ -1405,6 +1425,21 @@ static void shrink_worker(struct work_struct *w)
>          * until the next run of shrink_worker().
>          */
>         do {
> +               /*
> +                * delay shrinking to allow the last rejected page completes
> +                * its writeback
> +                */
> +               sleepuntil = delay + READ_ONCE(zswap_shrinker_delay_start);

I assume we do not care about racy access here right? Same goes for
updates - I don't see any locks protecting these operations (but I
could be missing something).

> +               now = jiffies;
> +               /*
> +                * If zswap did not reject pages for long, sleepuntil-now may
> +                * underflow.  We assume the timestamp is valid only if
> +                * now < sleepuntil < now + delay + 1
> +                */
> +               if (time_before(now, sleepuntil) &&
> +                               time_before(sleepuntil, now + delay + 1))
> +                       fsleep(jiffies_to_usecs(sleepuntil - now));
> +
>                 spin_lock(&zswap_shrink_lock);
>
>                 /*
> @@ -1526,8 +1561,10 @@ bool zswap_store(struct folio *folio)
>         VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
>
>         /* Large folios aren't supported */
> -       if (folio_test_large(folio))
> +       if (folio_test_large(folio)) {
> +               zswap_shrinker_delay_update();
>                 return false;
> +       }
>
>         if (!zswap_enabled)
>                 goto check_old;
> @@ -1648,6 +1685,8 @@ bool zswap_store(struct folio *folio)
>         zswap_entry_cache_free(entry);
>  reject:
>         obj_cgroup_put(objcg);
> +       zswap_shrinker_delay_update();
> +
>         if (need_global_shrink)
>                 queue_work(shrink_wq, &zswap_shrink_work);
>  check_old:
> @@ -1691,8 +1730,10 @@ bool zswap_load(struct folio *folio)
>         else
>                 entry = xa_load(tree, offset);
>
> -       if (!entry)
> +       if (!entry) {
> +               zswap_shrinker_delay_update();
>                 return false;
> +       }
>
>         if (entry->length)
>                 zswap_decompress(entry, page);
> @@ -1835,6 +1876,8 @@ static int zswap_setup(void)
>         if (ret)
>                 goto hp_fail;
>
> +       zswap_shrinker_delay_update();
> +
>         shrink_wq = alloc_workqueue("zswap-shrink",
>                         WQ_UNBOUND, 1);
>         if (!shrink_wq)
> --
> 2.43.0
>