Re: [PATCH v1 0/3] mm: zswap: global shrinker fix and proactive shrink

Nhat Pham <nphamcs@xxxxxxxxx> · Thu, 20 Jun 2024 15:45:17 -0700

On Wed, Jun 19, 2024 at 6:03 PM Takero Funaki <flintglass@xxxxxxxxx> wrote:
>
> Hello,
>
> Sorry for the late reply. I am currently investigating a
> responsiveness issue I found while benchmarking with this series,
> possibly related to concurrent zswap writeback and pageouts.
>
> This series cannot be applied until the root cause is identified,
> unfortunately. Thank you all for taking the time to review.
>
> The responsiveness issue was confirmed with 6.10-rc2 with all 3
> patches applied. Without patch 3, it still happens but is less likely.
>
> When allocating much larger memory than zswap can buffer, and
> writeback and rejection by pool_limit_hit happen simultaneously, the
> system stops responding. I do not see this freeze when zswap is
> disabled or when there is no pool_limit_hit. The proactive shrinking
> itself seems to work as expected as long as the writeback and pageout
> do not occur simultaneously.
>
> I suspect this issue exists in current code but was not visible
> without this series since the global shrinker did not writeback
> considerable amount of pages.
>
>
> 2024年6月15日(土) 7:48 Nhat Pham <nphamcs@xxxxxxxxx>:
> >
> > BTW, I'm curious. Have you experimented with increasing the pool size?
> > That 20% number is plenty for our use cases, but maybe yours need a
> > different cap?
> >
>
> Probably we can allocate a bit more zswap pool size. But that will
> keep more old pages once the pool limit is hit. If we can ensure no
> pool limit hits and zero writeback by allocating more memory, I will
> try the same amount of zramswap.
>
> > Also, have you experimented with the dynamic zswap shrinker? :) I'm
> > actually curious how it works out in the small machine regime, with
> > whatever workload you are running.
> >
>
> It seems the dynamic shrinker is trying to evict all pages. That does
> not fit to my use case that prefer balanced swapin and swapout
> performance

Hmm not quite. As you have noted earlier, it (tries to) shrink the
unprotected pages only,

>
>
> 2024年6月15日(土) 9:20 Yosry Ahmed <yosryahmed@xxxxxxxxxx>:
> > >
> > > 1.
> > > The visible issue is that pageout/in operations from active processes
> > > are slow when zswap is near its max pool size. This is particularly
> > > significant on small memory systems, where total swap usage exceeds
> > > what zswap can store. This means that old pages occupy most of the
> > > zswap pool space, and recent pages use swap disk directly.
> >
> > This should be a transient state though, right? Once the shrinker
> > kicks in it should writeback the old pages and make space for the hot
> > ones. Which takes us to our next point.
> >
> > >
> > > 2.
> > > This issue is caused by zswap keeping the pool size near 100%. Since
> > > the shrinker fails to shrink the pool to accept_thr_percent and zswap
> > > rejects incoming pages, rejection occurs more frequently than it
> > > should. The rejected pages are directly written to disk while zswap
> > > protects old pages from eviction, leading to slow pageout/in
> > > performance for recent pages on the swap disk.
> >
> > Why is the shrinker failing? IIUC the first two patches fixes two
> > cases where the shrinker stumbles upon offline memcgs, or memcgs with
> > no zswapped pages. Are these cases common enough in your use case that
> > every single time the shrinker runs it hits MAX_RECLAIM_RETRIES before
> > putting the zswap usage below accept_thr_percent?
> >
> > This would be surprising given that we should be restarting the
> > shrinker with every swapout attempt until we can accept pages again.
> >
> > I guess one could construct a malicious case where there are some
> > sticky offline memcgs, and all the memcgs that actually have zswap
> > pages come after it in the iteration order.
> >
> > Could you shed more light about this? What does the setup look like?
> > How many memcgs there are, how many of them use zswap, and how many
> > offline memcgs are you observing?
> >
>
> Example from ubuntu 22.04 using zswap:
> root@ctl:~# find /sys/fs/cgroup/ -wholename
> \*service/memory.zswap.current | xargs grep . | wc
>      31      31    2557
> root@ctl:~# find /sys/fs/cgroup/ -wholename
> \*service/memory.zswap.current | xargs grep ^0 | wc
>      11      11     911
>
> This indicates 11 out of 31 services have no pages in zswap. Without
> patch 2, shrink_worker() aborts shrinking in the second tree walk,
> before evicting about 40 pages from the services. The number varies,
> but I think it is common to see a few memcg that has no zswap pages
>
> > I am not saying we shouldn't fix these problems anyway, I am just
> > trying to understand how we got into this situation to begin with.
> >
> > >
> > > 3.
> > > If the pool size were shrunk proactively, rejection by pool limit hits
> > > would be less likely. New incoming pages could be accepted as the pool
> > > gains some space in advance, while older pages are written back in the
> > > background. zswap would then be filled with recent pages, as expected
> > > in the LRU logic.
> >
> > I suspect if patches 1 and 2 fix your problem, the shrinker invoked
> > from reclaim should be doing this sort of "proactive shrinking".
> >
> > I agree that the current hysteresis around accept_thr_percent is not
> > good enough, but I am surprised you are hitting the pool limit if the
> > shrinker is being run during reclaim.
> >
> > >
> > > Patch 1 and 2 make the shrinker reduce the pool to accept_thr_percent.
> > > Patch 3 makes zswap_store trigger the shrinker before reaching the max
> > > pool size. With this series, zswap will prepare some space to reduce
> > > the probability of problematic pool_limit_hit situation, thus reducing
> > > slow reclaim and the page priority inversion against LRU.
> > >
> > > 4.
> > > Once proactive shrinking reduces the pool size, pageouts complete
> > > instantly as long as the space prepared by shrinking can store the
> > > direct reclaim. If an admin sees a large pool_limit_hit, lowering
> > > accept_threshold_percent will improve active process performance.
> >
> > I agree that proactive shrinking is preferable to waiting until we hit
> > pool limit, then stop taking in pages until the acceptance threshold.
> > I am just trying to understand whether such a proactive shrinking
> > mechanism will be needed if the reclaim shrinker for zswap is being
> > used, how the two would work together.
>
> For my workload, the dynamic shrinker (reclaim shrinker) is disabled.
> The proposed global shrinker and the existing dynamic shrinker are
> both proactive, but their goals are different.
>
> The global shrinker starts shrinking when the zswap pool exceeds
> accept_thr_percent + 1%, then stops when it reaches
> accept_thr_percent. Pages below accept_thr_percent are protected from
> shrinking.
>
> The dynamic shrinker starts shrinking based on memory pressure
> regardless of the zswap pool size, and stops when the LRU size is
> reduced to 1/4. Its goal is to wipe out all pages from zswap. It
> prefers swapout performance only.
>
> I think the current LRU logic decreases nr_zswap_protected too quickly
> for my workload. In zswap_lru_add(), nr_zswap_protected is reduced to
> between 1/4 and 1/8 of the LRU size. Although zswap_folio_swapin()
> increments nr_zswap_protected when page-ins of evicted pages occur
> later, this technically has no effect while reclaim is in progress.
>
> While zswap_store() and zswap_lru_add() are called, the dynamic
> shrinker is likely running due to the pressure. The dynamic shrinker
> reduces the LRU size to 1/4, and then a few subsequent zswap_store()
> calls reduce the protected count to 1/4 of the LRU size. The stored
> pages will be reduced to zero through a few shrinker_scan calls.

Ah this is a fair point. We've been observing this in
production/experiments as well - there's seems to be a positive
correlation between zswpout rate and zswap_written_back rate. Whenever
there's a spike in zswpout, you also see a spike in writtenback pages
too - looks like the flood of zswpout weaken zswap's lru protection,
which is not quite the intended effect.

We're working to improve this situation. We have a couple of ideas
floating around, none of which are too complicated to implement, but
need experiments to validate before sending upstream :)