Re: [PATCH v2 6/6] mm: zswap: interrupt shrinker writeback while pagein/out IO

Nhat Pham <nphamcs@xxxxxxxxx> · Wed, 10 Jul 2024 15:10:00 -0700

On Wed, Jul 10, 2024 at 2:21 PM Takero Funaki <flintglass@xxxxxxxxx> wrote:
>
> 2024年7月9日(火) 4:17 Nhat Pham <nphamcs@xxxxxxxxx>:
>
> >
> > Do you see this problem actually manifesting in real life? Does not
> > sound infeasible to me, but I wonder how likely this is the case.
> >
> > Do you have any userspace-visible metrics, or any tracing logs etc.
> > that proves that it actually happens?
> >
> > This might also affect the dynamic shrinker as well FWIW.
> >
>
> Although it is rare, on a small VM with 0.5GB RAM, performing `apt
> upgrade` for ubuntu kernel update degrades system responsiveness.
> Since kernel upgrade is memory consuming for zstd compressed
> initramfs, there is heavy memory pressure like the benchmark.
>
> Unfortunately I could not get evidence that clearly indicates the
> contention. Perhaps IO latency can be a metric?

IO latency, IO pressure, etc. I'm not sure either, as I do not have
the benchmark setup at hand - just throwing out ideas :)

> While allocating large memory, perf showed that __swap_writepage() was
> consuming time and was called mostly from kswapd and some fraction
> from user faults of python script and from shrink_worker. CPU was
> mostly idling even in a single CPU system, so lock contention and
> compression should not be the reason. I believe these behaviors
> suggest contention on writeback IO.

Sounds fair to me.

> As shown in the benchmark,  reducing shrinker writeback by patches 3
> to 6 reduced elapsed time, which also indicates IO contention.
>
> > > +/*
> > > + * To avoid IO contention between pagein/out and global shrinker writeback,
> > > + * track the last jiffies of pagein/out and delay the writeback.
> > > + * Default to 500msec in alignment with mq-deadline read timeout.
> >
> > If there is a future version, could you include the reason why you
> > select 500msec in the patch's changelog as well?
> >
>
> The 500ms can be any value longer than the average interval of each
> pageout/in and is not significant for behavior. If subsequent pageout
> rejection occurs while the shrinker is sleeping, writeback will be
> delayed again by 500ms from the last timestamp update. If pageout
> occurs at a 1ms interval on average, the minimal delay should be 1+ms.
>
> I chose 500ms from the mq-deadline scheduler that tries to perform
> read IO in a 500ms timeframe by default (bfq for HDD uses a shorter
> timeout).
> When the shrinker performs writeback IO with a 500ms delay from the
> last pagein, the write IO will be of lower priority than the read IO
> waiting in the queue, as the pagein read becomes the highest priority
> by the deadline. This logic emulates low-priority write IO by
> voluntarily delaying IO.
>
>
> >
> > Hmmm is there a reason why we do not just do:
> >
> > zswap_shrinker_delay_start = jiffies;
> >
> > or, more unnecessarily:
> >
> > unsigned long now = jiffies;
> >
> > zswap_shrinker_delay_start = now;
> >
> > IOW, why the branching here? Does not seem necessary to me, but
> > perhaps there is a correctness/compiler reason I'm not seeing?
> >
> > In fact, if it's the first version, then we could manually inline it.
> >
>
> That was to avoid invalidating the CPU cache of the shared variable
> unnecessarily. Removing the branch and manually inlining it for v3.

Ah I see. Seems a tad overengineering IMHO, but I'll leave this up to you.

If you keep the current version, please add a comment.

>
>
> > Additionally/alternatively, I wonder if it is more convenient to do it
> > at the mm/page_io.c zswap callsites, i.e whenever zswap_store() and
> > zswap_load() returns false, then delay the shrinker before proceeding
> > with the IO steps.
> >
>
> Should we expose the timestamp variable? It is only used in zswap
> internally, and the timestamp is not required when zswap is disabled.

Not necessarily the timestamp. If you keep the helper, you can just
expose that helper (and have a no-op when !CONFIG_ZSWAP in the zswap.h
header file).

>
> > >         do {
> > > +               /*
> > > +                * delay shrinking to allow the last rejected page completes
> > > +                * its writeback
> > > +                */
> > > +               sleepuntil = delay + READ_ONCE(zswap_shrinker_delay_start);
> >
> > I assume we do not care about racy access here right? Same goes for
> > updates - I don't see any locks protecting these operations (but I
> > could be missing something).
> >
>
> Right. Do we need atomic or spinlock for safety?
>  I think the bare store/load of unsigned long is sufficient here. The
> possible deviation by concurrent updates is mostly +/-1 jiffy. Sleep
> does not need ms accuracy.
>
> Ah, I found a mistake here. v2 missed continue statement in the loop.
> The delay should be extended if zswap_store() rejects another page. In
> v2, one writeback was allowed per 500ms, which was not my intended
> behavior.
> The corrected logic for v3 should be:
>
>                if (time_before(now, sleepuntil) &&
>                                time_before(sleepuntil, now + delay + 1)) {
>                        fsleep(jiffies_to_usecs(sleepuntil - now));
>                        /* check if subsequent pageout/in extended delay */
>                        continue;
>                }
>
>
> 2024年7月9日(火) 9:57 Nhat Pham <nphamcs@xxxxxxxxx>:
> >
> > Hmm what about this scenario: when we disable zswap writeback on a
> > cgroup, if zswap_store() fails, we are delaying the global shrinker
> > for no gain essentially. There is no subsequent IO. I don't think we
> > are currently handling this, right?
> >
> > >
> > > The same logic applies to zswap_load(). When zswap cannot find requested
> > > page from pool and read IO is performed, shrinker should be interrupted.
> > >
> >
> > Yet another (less concerning IMHO) scenario is when a cgroup disables
> > zswap by setting zswap.max = 0 (for instance, if the sysadmin knows
> > that this cgroup's data are really cold, and/or that the workload is
> > latency-tolerant, and do not want it to take up valuable memory
> > resources of other cgroups). Every time this cgroup reclaims memory,
> > it would disable the global shrinker (including the new proactive
> > behavior) for other cgroup, correct? And, when they do need to swap
> > in, it would further delay the global shrinker. Would this break of
> > isolation be a problem?
> >
> > There are other concerns I raised in the cover letter's response as
> > well - please take a look :)
>
> I haven't considered these cases much, but I suppose the global
> shrinker should be delayed in both cases as well. In general, any
> pagein/out should be prefered over shrinker writeback throughput.
>
> When zswap writeback was disabled for a memcg
> (memcg.zswap.writeback=0), I suppose disabling/delaying writeback is
> harmless.
> If the rejection incurs no IO, there is no more memory pressure and
> shrinking is not urgent. We can postpone the shrinker writeback. If
> the rejection incurs IO (i.e. mm choose another page from a memcg with
> writeback enabled), again we should delay the shrinker.

You are delaying writeback globally right? IOW, other cgroup is also affected?

Say we are under memory pressure, with two cgroups under reclaim - one
with zswap writeback disabled. The one with writeback disabled will
constantly fail at zswap_store(), and delay the global shrinking
action, which could have targeted the other cgroup (which should
proceed fully because there is no contention here since the first
cgroup is not swapping either).

>
> For pageout from latency-tolerant memcg (zswap.max=0), I think pageout
> latency may affect other memcgs.
> For example, when a latency-sensitive workload tries to allocate
> memory, mm might choose to swap out pages from zswap-disabled memcg.
> The slow direct pageout may indirectly delay allocation of the
> latency-sensitive workload. IOW, we cannot determine which workload
> would be blocked by a slow pageout based on which memcg owns the page.
> In this case, it would be better to delay shrinker writeback even if
> the owner is latency tolerant memcg.
> Also for pagein, we cannot determine how urgent the pagein is.
>
> Delaying shrinker on any pagein/out diminishes proactive shrinking
> progress, but that is still better than the existing shrinker that
> cannot shrink.

This is fair though :)