Re: [PATCH 1/2] zswap: implement a second chance algorithm for dynamic zswap shrinker

Nhat Pham <nphamcs@xxxxxxxxx> · Mon, 29 Jul 2024 17:07:11 -0700

On Mon, Jul 29, 2024 at 4:07 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
>
> On Fri, Jul 26, 2024 at 2:58 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
> >
> > On Thu, Jul 25, 2024 at 4:28 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
> > >
> > > Current zswap shrinker's heursitics to prevent overshrinking is brittle
> > > and inaccurate, specifically in the way we decay the protection size
> > > (i.e making pages in the zswap LRU eligible for reclaim).
> >
> > Thanks for working on this and experimenting with different
> > heuristics. I was not a huge fan of these, so I am glad we are trying
> > to replace them with something more intuitive.
>
> That is certainly the intention :) I'm not a huge fan of those
> heuristics either - they seem fairly brittle and arbitrary to me even
> back then (as you have pointed out).
>
> >
> > >
> > > We currently decay protection aggressively in zswap_lru_add() calls.
> > > This leads to the following unfortunate effect: when a new batch of
> > > pages enter zswap, the protection size rapidly decays to below 25% of
> > > the zswap LRU size, which is way too low.
> > >
> > > We have observed this effect in production, when experimenting with the
> > > zswap shrinker: the rate of shrinking shoots up massively right after a
> > > new batch of zswap stores. This is somewhat the opposite of what we want
> > > originally - when new pages enter zswap, we want to protect both these
> > > new pages AND the pages that are already protected in the zswap LRU.
> > >
> > > Replace existing heuristics with a second chance algorithm
> > >
> > > 1. When a new zswap entry is stored in the zswap pool, its reference bit
> > >    is set.
> > > 2. When the zswap shrinker encounters a zswap entry with the reference
> > >    bit set, give it a second chance - only flips the reference bit and
> > >    rotate it in the LRU.
> > > 3. If the shrinker encounters the entry again, this time with its
> > >    reference bit unset, then it can reclaim the entry.
> >
> > At the first look, this is similar to the reclaim algorithm. A
> > fundamental difference here is that the reference bit is only set
> > once, when the entry is created. It is different from the conventional
> > second chance page reclaim/replacement algorithm.
>
> I suppose, yeah. We no longer have non-exclusive loading (in most
> scenarios), so the reference bit is only set once, on store attempt.
>
> The interpretation/justification is still there though (somewhat). We
> are still giving each object another "chance" to stay in the zswap
> pool, in case it is needed soon, before it is reclaimed.
>
> > Taking a step back, what we really want is to writeback zswap entries
> > in order, and avoid writing back more entries than needed. I think the
> > key here is "when needed", which is defined by how much memory
> > pressure we have. The shrinker framework should already be taking this
> > into account.
> >
> > Looking at do_shrink_slab(), in the case of zswap (seek = 2),
> > total_scan should boil down to:
> >
> > total_scan = (zswap_shrinker_count() * 2 + nr_deferred) >> priority
> >
> > , and this is bounded by zswap_shrinker_count() * 2.
> >
> > Ignoring nr_deferred, we start by scanning 1/2048th of
> > zswap_shrinker_count() at DEF_PRIORITY, then we work our way to 2 *
> > zswap_shrinker_count() at zero priority (before OOMs). At
> > NODE_RECLAIM_PRIORITY, we start at 1/8th of zswap_shrinker_count().
> >
> > Keep in mind that zswap_shrinker_count() does not return the number of
> > all zswap entries, it subtracts the protected part (or recent swapins)
>
> Note that this "protection" decays rather aggressively with zswap
> stores. From zswap_lru_add():
>
> new = old > lru_size / 4 ? old / 2 : old;
>
> IOW, if the protection size exceeds 25% lru_size, half it. A couple of
> zswap_stores() could easily slash this to 25% of the LRU (or even
> below) rapidly.
>
> I guess I can fiddle with this decaying attempt + decouple the
> decaying from storing (i.e moving it somewhere else). But any formula
> I can come up with (another multiplicative factor?) seems as arbitrary
> as this formula, and any places that I place the decaying could
> potentially couple two actions, which lead to unintended effect.
>
> The second chance algorithm creates a similar two-section LRU
> configuration as the old scheme - protected/unprotected (or, analogous
> to the page reclaim algorithm, active/inactive). However, it bypasses
> the need for such an artificially constructed formula, which you can
> think of as the rate of objects aging. It naturally ties the aging of
> objects (i.e moving the objects from one section to another) in the
> zswap pool with memory pressure, which makes sense to me, FWIW.
>
> > and scales by the compression ratio. So this looks reasonable at first
> > sight, perhaps we want to tune the seek to slow down writeback if we
> > think it's too much, but that doesn't explain the scenario you are
> > describing.
> >
> > Now let's factor in nr_deferred, which looks to me like it could be
> > the culprit here. I am assuming the intention is that if we counted
> > freeable slab objects before but didn't get to free them, we should do
> > it the next time around. This feels like it assumes that the objects
> > will remain there unless reclaimed by the shrinker. This does not
> > apply for zswap, because the objects can be swapped in.
> >
> > Also, in the beginning, before we encounter too many swapins, the
> > protection will be very low, so zswap_shrinker_count() will return a
> > relatively high value. Even if we don't scan and writeback this
> > amount, we will keep carrying this value forward in next reclaim
> > operations, even if the number of existing zswap entries have
> > decreased due to swapins.
> >
> > Could this be the problem? The number of deferred objects to be
> > scanned just keeps going forward as a high value, essentially
> > rendering the heuristics in zswap_shrinker_count() useless?
>
> Interesting theory. I wonder if I can do some tracing to investigate
> this (or maybe just spam trace_printk() lol).

OK, I found this trace:

sudo echo 1 > /sys/kernel/debug/tracing/events/vmscan/mm_shrink_slab_end/enable

I'll give this a shot, and report any interesting findings in the data
collected :)

>
> But yeah, this nr_deferred part seems suspicious. The number of
> freeable objects returned by the shrinker_count() function (at least
> the one implemented by zswap) doesn't really track which object is
> newly freeable since the last shrinker_count() invocation. So an
> object can be accounted for in a previous invocation, fail to be
> reclaimed in that invocation yet still count towards future shrinker
> invocation? That sounds wrong :) Even without zswapin, that's still
> double counting, no? Future shrinker_count() call will still consider
> the previously-failed-to-reclaim object. Do other shrinkers somehow
> track the objects that have been accounted for in previous shrinker
> attempts, and use this in the freeable computation?
>
> That said, how bad it is in practice depends on the rate of reclaim
> failure, since the nr_deferred only increments when we fail to scan.
> Which is actually not very high - for instance, this is the statistics
> from a couple of my benchmark runs (using the old zswap shrinker
> protection scheme, i.e the status quo):
>
> /sys/kernel/debug/zswap/reject_reclaim_fail:292
> /sys/kernel/debug/zswap/written_back_pages:6947364
>
> IIRC, exclusive loading really lowers the number of reclaim rejection
> (you're less likely to observe a page that is already loaded into
> memory and in swap cache - its zswap entry is already invalidated).
>

Actually, on closer inspection, we also increasing the deferred count
when we short-circuit in each of the following scenario:

a) When we disable zswap shrinker or zswap writeback in general. I
don't think this matters too much in practice - we're not reclaiming
here anyway :)
b) When we detect that there are less unprotected objects than the batch size.

FWIW, the second case also goes away with the new scheme :)

> This could be a problem, and probably should be fixed (for instance,
> by adding a no-defer mode for shrinkers similar to zswap), but there
> seems to be an additional/orthogonal reason for overshrinking.

I can also just implement this and benchmark it :)