Re: [PATCH v2 6/6] mm: zswap: interrupt shrinker writeback while pagein/out IO

Takero Funaki <flintglass@xxxxxxxxx> · Thu, 11 Jul 2024 06:21:09 +0900

2024年7月9日(火) 4:17 Nhat Pham <nphamcs@xxxxxxxxx>:

>
> Do you see this problem actually manifesting in real life? Does not
> sound infeasible to me, but I wonder how likely this is the case.
>
> Do you have any userspace-visible metrics, or any tracing logs etc.
> that proves that it actually happens?
>
> This might also affect the dynamic shrinker as well FWIW.
>

Although it is rare, on a small VM with 0.5GB RAM, performing `apt
upgrade` for ubuntu kernel update degrades system responsiveness.
Since kernel upgrade is memory consuming for zstd compressed
initramfs, there is heavy memory pressure like the benchmark.

Unfortunately I could not get evidence that clearly indicates the
contention. Perhaps IO latency can be a metric?
While allocating large memory, perf showed that __swap_writepage() was
consuming time and was called mostly from kswapd and some fraction
from user faults of python script and from shrink_worker. CPU was
mostly idling even in a single CPU system, so lock contention and
compression should not be the reason. I believe these behaviors
suggest contention on writeback IO.
As shown in the benchmark,  reducing shrinker writeback by patches 3
to 6 reduced elapsed time, which also indicates IO contention.

> > +/*
> > + * To avoid IO contention between pagein/out and global shrinker writeback,
> > + * track the last jiffies of pagein/out and delay the writeback.
> > + * Default to 500msec in alignment with mq-deadline read timeout.
>
> If there is a future version, could you include the reason why you
> select 500msec in the patch's changelog as well?
>

The 500ms can be any value longer than the average interval of each
pageout/in and is not significant for behavior. If subsequent pageout
rejection occurs while the shrinker is sleeping, writeback will be
delayed again by 500ms from the last timestamp update. If pageout
occurs at a 1ms interval on average, the minimal delay should be 1+ms.

I chose 500ms from the mq-deadline scheduler that tries to perform
read IO in a 500ms timeframe by default (bfq for HDD uses a shorter
timeout).
When the shrinker performs writeback IO with a 500ms delay from the
last pagein, the write IO will be of lower priority than the read IO
waiting in the queue, as the pagein read becomes the highest priority
by the deadline. This logic emulates low-priority write IO by
voluntarily delaying IO.

>
> Hmmm is there a reason why we do not just do:
>
> zswap_shrinker_delay_start = jiffies;
>
> or, more unnecessarily:
>
> unsigned long now = jiffies;
>
> zswap_shrinker_delay_start = now;
>
> IOW, why the branching here? Does not seem necessary to me, but
> perhaps there is a correctness/compiler reason I'm not seeing?
>
> In fact, if it's the first version, then we could manually inline it.
>

That was to avoid invalidating the CPU cache of the shared variable
unnecessarily. Removing the branch and manually inlining it for v3.

> Additionally/alternatively, I wonder if it is more convenient to do it
> at the mm/page_io.c zswap callsites, i.e whenever zswap_store() and
> zswap_load() returns false, then delay the shrinker before proceeding
> with the IO steps.
>

Should we expose the timestamp variable? It is only used in zswap
internally, and the timestamp is not required when zswap is disabled.

> >         do {
> > +               /*
> > +                * delay shrinking to allow the last rejected page completes
> > +                * its writeback
> > +                */
> > +               sleepuntil = delay + READ_ONCE(zswap_shrinker_delay_start);
>
> I assume we do not care about racy access here right? Same goes for
> updates - I don't see any locks protecting these operations (but I
> could be missing something).
>

Right. Do we need atomic or spinlock for safety?
 I think the bare store/load of unsigned long is sufficient here. The
possible deviation by concurrent updates is mostly +/-1 jiffy. Sleep
does not need ms accuracy.

Ah, I found a mistake here. v2 missed continue statement in the loop.
The delay should be extended if zswap_store() rejects another page. In
v2, one writeback was allowed per 500ms, which was not my intended
behavior.
The corrected logic for v3 should be:

               if (time_before(now, sleepuntil) &&
                               time_before(sleepuntil, now + delay + 1)) {
                       fsleep(jiffies_to_usecs(sleepuntil - now));
                       /* check if subsequent pageout/in extended delay */
                       continue;
               }

2024年7月9日(火) 9:57 Nhat Pham <nphamcs@xxxxxxxxx>:
>
> Hmm what about this scenario: when we disable zswap writeback on a
> cgroup, if zswap_store() fails, we are delaying the global shrinker
> for no gain essentially. There is no subsequent IO. I don't think we
> are currently handling this, right?
>
> >
> > The same logic applies to zswap_load(). When zswap cannot find requested
> > page from pool and read IO is performed, shrinker should be interrupted.
> >
>
> Yet another (less concerning IMHO) scenario is when a cgroup disables
> zswap by setting zswap.max = 0 (for instance, if the sysadmin knows
> that this cgroup's data are really cold, and/or that the workload is
> latency-tolerant, and do not want it to take up valuable memory
> resources of other cgroups). Every time this cgroup reclaims memory,
> it would disable the global shrinker (including the new proactive
> behavior) for other cgroup, correct? And, when they do need to swap
> in, it would further delay the global shrinker. Would this break of
> isolation be a problem?
>
> There are other concerns I raised in the cover letter's response as
> well - please take a look :)

I haven't considered these cases much, but I suppose the global
shrinker should be delayed in both cases as well. In general, any
pagein/out should be prefered over shrinker writeback throughput.

When zswap writeback was disabled for a memcg
(memcg.zswap.writeback=0), I suppose disabling/delaying writeback is
harmless.
If the rejection incurs no IO, there is no more memory pressure and
shrinking is not urgent. We can postpone the shrinker writeback. If
the rejection incurs IO (i.e. mm choose another page from a memcg with
writeback enabled), again we should delay the shrinker.

For pageout from latency-tolerant memcg (zswap.max=0), I think pageout
latency may affect other memcgs.
For example, when a latency-sensitive workload tries to allocate
memory, mm might choose to swap out pages from zswap-disabled memcg.
The slow direct pageout may indirectly delay allocation of the
latency-sensitive workload. IOW, we cannot determine which workload
would be blocked by a slow pageout based on which memcg owns the page.
In this case, it would be better to delay shrinker writeback even if
the owner is latency tolerant memcg.
Also for pagein, we cannot determine how urgent the pagein is.

Delaying shrinker on any pagein/out diminishes proactive shrinking
progress, but that is still better than the existing shrinker that
cannot shrink.