On Fri, Jul 26, 2024 at 2:58 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > On Thu, Jul 25, 2024 at 4:28 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote: > > > > Current zswap shrinker's heursitics to prevent overshrinking is brittle > > and inaccurate, specifically in the way we decay the protection size > > (i.e making pages in the zswap LRU eligible for reclaim). > > Thanks for working on this and experimenting with different > heuristics. I was not a huge fan of these, so I am glad we are trying > to replace them with something more intuitive. That is certainly the intention :) I'm not a huge fan of those heuristics either - they seem fairly brittle and arbitrary to me even back then (as you have pointed out). > > > > > We currently decay protection aggressively in zswap_lru_add() calls. > > This leads to the following unfortunate effect: when a new batch of > > pages enter zswap, the protection size rapidly decays to below 25% of > > the zswap LRU size, which is way too low. > > > > We have observed this effect in production, when experimenting with the > > zswap shrinker: the rate of shrinking shoots up massively right after a > > new batch of zswap stores. This is somewhat the opposite of what we want > > originally - when new pages enter zswap, we want to protect both these > > new pages AND the pages that are already protected in the zswap LRU. > > > > Replace existing heuristics with a second chance algorithm > > > > 1. When a new zswap entry is stored in the zswap pool, its reference bit > > is set. > > 2. When the zswap shrinker encounters a zswap entry with the reference > > bit set, give it a second chance - only flips the reference bit and > > rotate it in the LRU. > > 3. If the shrinker encounters the entry again, this time with its > > reference bit unset, then it can reclaim the entry. > > At the first look, this is similar to the reclaim algorithm. A > fundamental difference here is that the reference bit is only set > once, when the entry is created. It is different from the conventional > second chance page reclaim/replacement algorithm. I suppose, yeah. We no longer have non-exclusive loading (in most scenarios), so the reference bit is only set once, on store attempt. The interpretation/justification is still there though (somewhat). We are still giving each object another "chance" to stay in the zswap pool, in case it is needed soon, before it is reclaimed. > Taking a step back, what we really want is to writeback zswap entries > in order, and avoid writing back more entries than needed. I think the > key here is "when needed", which is defined by how much memory > pressure we have. The shrinker framework should already be taking this > into account. > > Looking at do_shrink_slab(), in the case of zswap (seek = 2), > total_scan should boil down to: > > total_scan = (zswap_shrinker_count() * 2 + nr_deferred) >> priority > > , and this is bounded by zswap_shrinker_count() * 2. > > Ignoring nr_deferred, we start by scanning 1/2048th of > zswap_shrinker_count() at DEF_PRIORITY, then we work our way to 2 * > zswap_shrinker_count() at zero priority (before OOMs). At > NODE_RECLAIM_PRIORITY, we start at 1/8th of zswap_shrinker_count(). > > Keep in mind that zswap_shrinker_count() does not return the number of > all zswap entries, it subtracts the protected part (or recent swapins) Note that this "protection" decays rather aggressively with zswap stores. From zswap_lru_add(): new = old > lru_size / 4 ? old / 2 : old; IOW, if the protection size exceeds 25% lru_size, half it. A couple of zswap_stores() could easily slash this to 25% of the LRU (or even below) rapidly. I guess I can fiddle with this decaying attempt + decouple the decaying from storing (i.e moving it somewhere else). But any formula I can come up with (another multiplicative factor?) seems as arbitrary as this formula, and any places that I place the decaying could potentially couple two actions, which lead to unintended effect. The second chance algorithm creates a similar two-section LRU configuration as the old scheme - protected/unprotected (or, analogous to the page reclaim algorithm, active/inactive). However, it bypasses the need for such an artificially constructed formula, which you can think of as the rate of objects aging. It naturally ties the aging of objects (i.e moving the objects from one section to another) in the zswap pool with memory pressure, which makes sense to me, FWIW. > and scales by the compression ratio. So this looks reasonable at first > sight, perhaps we want to tune the seek to slow down writeback if we > think it's too much, but that doesn't explain the scenario you are > describing. > > Now let's factor in nr_deferred, which looks to me like it could be > the culprit here. I am assuming the intention is that if we counted > freeable slab objects before but didn't get to free them, we should do > it the next time around. This feels like it assumes that the objects > will remain there unless reclaimed by the shrinker. This does not > apply for zswap, because the objects can be swapped in. > > Also, in the beginning, before we encounter too many swapins, the > protection will be very low, so zswap_shrinker_count() will return a > relatively high value. Even if we don't scan and writeback this > amount, we will keep carrying this value forward in next reclaim > operations, even if the number of existing zswap entries have > decreased due to swapins. > > Could this be the problem? The number of deferred objects to be > scanned just keeps going forward as a high value, essentially > rendering the heuristics in zswap_shrinker_count() useless? Interesting theory. I wonder if I can do some tracing to investigate this (or maybe just spam trace_printk() lol). But yeah, this nr_deferred part seems suspicious. The number of freeable objects returned by the shrinker_count() function (at least the one implemented by zswap) doesn't really track which object is newly freeable since the last shrinker_count() invocation. So an object can be accounted for in a previous invocation, fail to be reclaimed in that invocation yet still count towards future shrinker invocation? That sounds wrong :) Even without zswapin, that's still double counting, no? Future shrinker_count() call will still consider the previously-failed-to-reclaim object. Do other shrinkers somehow track the objects that have been accounted for in previous shrinker attempts, and use this in the freeable computation? That said, how bad it is in practice depends on the rate of reclaim failure, since the nr_deferred only increments when we fail to scan. Which is actually not very high - for instance, this is the statistics from a couple of my benchmark runs (using the old zswap shrinker protection scheme, i.e the status quo): /sys/kernel/debug/zswap/reject_reclaim_fail:292 /sys/kernel/debug/zswap/written_back_pages:6947364 IIRC, exclusive loading really lowers the number of reclaim rejection (you're less likely to observe a page that is already loaded into memory and in swap cache - its zswap entry is already invalidated). This could be a problem, and probably should be fixed (for instance, by adding a no-defer mode for shrinkers similar to zswap), but there seems to be an additional/orthogonal reason for overshrinking. > > If we just need to slow down writeback by making sure we scan entries > twice, could something similar be achieved just by tuning the seek > without needing any heuristics to begin with? Ah but this is a good point. My only defence for not using seek would be I have no clue how to interpret the seek value :) Based on the documentation: int seeks; /* seeks to recreate an obj */ So I'm guessing this sorta kinda represents the number of disk seeks (or IO?) required to reconstruct the object. Which makes sense in the special case of 0. But it seems really arbitrary otherwise - DEFAULT_SEEKS is 2, but zswap does not really take 2 seeks/IOs to reconstruct the object, no? It might have been different in the past - I don't know the historical context here. But now, I guess this is just a shorthand for writeback pace. The second chance algorithm has a (nominal) justification that backs it, even though in effect it's the same. But yeah, if we only think about actual effects, perhaps we don't even need the reference bit manipulation - just halving the writeback pace by half is good enough. There might be some differences with non-exclusive loading, but now that's gone (for the most part). The other difference I can think of is in the aging-reclaiming ordering: 1. With the reference bit scheme, shrinkers will age the entire LRU before reclaiming anything (due to LRU rotation). 2. WIth increasing seek value, we will start reclaiming right away (albeit half of our old pace). Average reclaim rate should stay the same, but the former gives the objects on the zswap pool more chance to be loaded into main memory at the beginning. Or maybe I'm overthinking this and it doesn't really matter - benchmarking time? :) > > I am just trying to understand if the main problem is that zswap does > not fit well into the shrinker framework as it is, and how we can > improve this. Agree. Another potential problem is under highly concurrent settings, we can have many shrinker instances hitting the same zswap LRU. And since the protection mechanism is racy/best-effort at best, we can easily write back well into the protection area. I was seriously considering physically separating the protected (active) and unprotected (inactive) LRUs for zswap, but the second chance algorithm seems a natural way to implement this separation, and plays quite well with the current shrinker framework - aging and reclaiming are all serialized via the lru lock. > Just to be clear, I am in favor of changing those heuristics to > something more intuitive and simpler, but I really want to understand > what is going on. The approach taken by this patch is definitely > simpler, but it doesn't feel more intuitive to me (at least not yet). Thanks, yeah I agree. I was actually thinking that the second chance algorithm would be a bit more natural than some protection decaying formulae I pull out of thin air, but perhaps there are still problems with it. > If we take this approach, this needs to be placed in the hole after > the length, to avoid increasing the size of the zswap_entry. Very good point - I was actually thinking about this the other day, but somehow forgot to fix it before submission :) I'll fix this.