On Mon, Jul 29, 2024 at 4:07 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote: > > On Fri, Jul 26, 2024 at 2:58 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > > > On Thu, Jul 25, 2024 at 4:28 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote: > > > > > > Current zswap shrinker's heursitics to prevent overshrinking is brittle > > > and inaccurate, specifically in the way we decay the protection size > > > (i.e making pages in the zswap LRU eligible for reclaim). > > > > Thanks for working on this and experimenting with different > > heuristics. I was not a huge fan of these, so I am glad we are trying > > to replace them with something more intuitive. > > That is certainly the intention :) I'm not a huge fan of those > heuristics either - they seem fairly brittle and arbitrary to me even > back then (as you have pointed out). > > > > > > > > > We currently decay protection aggressively in zswap_lru_add() calls. > > > This leads to the following unfortunate effect: when a new batch of > > > pages enter zswap, the protection size rapidly decays to below 25% of > > > the zswap LRU size, which is way too low. > > > > > > We have observed this effect in production, when experimenting with the > > > zswap shrinker: the rate of shrinking shoots up massively right after a > > > new batch of zswap stores. This is somewhat the opposite of what we want > > > originally - when new pages enter zswap, we want to protect both these > > > new pages AND the pages that are already protected in the zswap LRU. > > > > > > Replace existing heuristics with a second chance algorithm > > > > > > 1. When a new zswap entry is stored in the zswap pool, its reference bit > > > is set. > > > 2. When the zswap shrinker encounters a zswap entry with the reference > > > bit set, give it a second chance - only flips the reference bit and > > > rotate it in the LRU. > > > 3. If the shrinker encounters the entry again, this time with its > > > reference bit unset, then it can reclaim the entry. > > > > At the first look, this is similar to the reclaim algorithm. A > > fundamental difference here is that the reference bit is only set > > once, when the entry is created. It is different from the conventional > > second chance page reclaim/replacement algorithm. > > I suppose, yeah. We no longer have non-exclusive loading (in most > scenarios), so the reference bit is only set once, on store attempt. > > The interpretation/justification is still there though (somewhat). We > are still giving each object another "chance" to stay in the zswap > pool, in case it is needed soon, before it is reclaimed. > > > Taking a step back, what we really want is to writeback zswap entries > > in order, and avoid writing back more entries than needed. I think the > > key here is "when needed", which is defined by how much memory > > pressure we have. The shrinker framework should already be taking this > > into account. > > > > Looking at do_shrink_slab(), in the case of zswap (seek = 2), > > total_scan should boil down to: > > > > total_scan = (zswap_shrinker_count() * 2 + nr_deferred) >> priority > > > > , and this is bounded by zswap_shrinker_count() * 2. > > > > Ignoring nr_deferred, we start by scanning 1/2048th of > > zswap_shrinker_count() at DEF_PRIORITY, then we work our way to 2 * > > zswap_shrinker_count() at zero priority (before OOMs). At > > NODE_RECLAIM_PRIORITY, we start at 1/8th of zswap_shrinker_count(). > > > > Keep in mind that zswap_shrinker_count() does not return the number of > > all zswap entries, it subtracts the protected part (or recent swapins) > > Note that this "protection" decays rather aggressively with zswap > stores. From zswap_lru_add(): > > new = old > lru_size / 4 ? old / 2 : old; > > IOW, if the protection size exceeds 25% lru_size, half it. A couple of > zswap_stores() could easily slash this to 25% of the LRU (or even > below) rapidly. > > I guess I can fiddle with this decaying attempt + decouple the > decaying from storing (i.e moving it somewhere else). But any formula > I can come up with (another multiplicative factor?) seems as arbitrary > as this formula, and any places that I place the decaying could > potentially couple two actions, which lead to unintended effect. > > The second chance algorithm creates a similar two-section LRU > configuration as the old scheme - protected/unprotected (or, analogous > to the page reclaim algorithm, active/inactive). However, it bypasses > the need for such an artificially constructed formula, which you can > think of as the rate of objects aging. It naturally ties the aging of > objects (i.e moving the objects from one section to another) in the > zswap pool with memory pressure, which makes sense to me, FWIW. > > > and scales by the compression ratio. So this looks reasonable at first > > sight, perhaps we want to tune the seek to slow down writeback if we > > think it's too much, but that doesn't explain the scenario you are > > describing. > > > > Now let's factor in nr_deferred, which looks to me like it could be > > the culprit here. I am assuming the intention is that if we counted > > freeable slab objects before but didn't get to free them, we should do > > it the next time around. This feels like it assumes that the objects > > will remain there unless reclaimed by the shrinker. This does not > > apply for zswap, because the objects can be swapped in. > > > > Also, in the beginning, before we encounter too many swapins, the > > protection will be very low, so zswap_shrinker_count() will return a > > relatively high value. Even if we don't scan and writeback this > > amount, we will keep carrying this value forward in next reclaim > > operations, even if the number of existing zswap entries have > > decreased due to swapins. > > > > Could this be the problem? The number of deferred objects to be > > scanned just keeps going forward as a high value, essentially > > rendering the heuristics in zswap_shrinker_count() useless? > > Interesting theory. I wonder if I can do some tracing to investigate > this (or maybe just spam trace_printk() lol). OK, I found this trace: sudo echo 1 > /sys/kernel/debug/tracing/events/vmscan/mm_shrink_slab_end/enable I'll give this a shot, and report any interesting findings in the data collected :) > > But yeah, this nr_deferred part seems suspicious. The number of > freeable objects returned by the shrinker_count() function (at least > the one implemented by zswap) doesn't really track which object is > newly freeable since the last shrinker_count() invocation. So an > object can be accounted for in a previous invocation, fail to be > reclaimed in that invocation yet still count towards future shrinker > invocation? That sounds wrong :) Even without zswapin, that's still > double counting, no? Future shrinker_count() call will still consider > the previously-failed-to-reclaim object. Do other shrinkers somehow > track the objects that have been accounted for in previous shrinker > attempts, and use this in the freeable computation? > > That said, how bad it is in practice depends on the rate of reclaim > failure, since the nr_deferred only increments when we fail to scan. > Which is actually not very high - for instance, this is the statistics > from a couple of my benchmark runs (using the old zswap shrinker > protection scheme, i.e the status quo): > > /sys/kernel/debug/zswap/reject_reclaim_fail:292 > /sys/kernel/debug/zswap/written_back_pages:6947364 > > IIRC, exclusive loading really lowers the number of reclaim rejection > (you're less likely to observe a page that is already loaded into > memory and in swap cache - its zswap entry is already invalidated). > Actually, on closer inspection, we also increasing the deferred count when we short-circuit in each of the following scenario: a) When we disable zswap shrinker or zswap writeback in general. I don't think this matters too much in practice - we're not reclaiming here anyway :) b) When we detect that there are less unprotected objects than the batch size. FWIW, the second case also goes away with the new scheme :) > This could be a problem, and probably should be fixed (for instance, > by adding a no-defer mode for shrinkers similar to zswap), but there > seems to be an additional/orthogonal reason for overshrinking. I can also just implement this and benchmark it :)