On Wed, Jun 19, 2024 at 6:03 PM Takero Funaki <flintglass@xxxxxxxxx> wrote: > > Hello, > > Sorry for the late reply. I am currently investigating a > responsiveness issue I found while benchmarking with this series, > possibly related to concurrent zswap writeback and pageouts. > > This series cannot be applied until the root cause is identified, > unfortunately. Thank you all for taking the time to review. > > The responsiveness issue was confirmed with 6.10-rc2 with all 3 > patches applied. Without patch 3, it still happens but is less likely. > > When allocating much larger memory than zswap can buffer, and > writeback and rejection by pool_limit_hit happen simultaneously, the > system stops responding. I do not see this freeze when zswap is > disabled or when there is no pool_limit_hit. The proactive shrinking > itself seems to work as expected as long as the writeback and pageout > do not occur simultaneously. > > I suspect this issue exists in current code but was not visible > without this series since the global shrinker did not writeback > considerable amount of pages. > > > 2024年6月15日(土) 7:48 Nhat Pham <nphamcs@xxxxxxxxx>: > > > > BTW, I'm curious. Have you experimented with increasing the pool size? > > That 20% number is plenty for our use cases, but maybe yours need a > > different cap? > > > > Probably we can allocate a bit more zswap pool size. But that will > keep more old pages once the pool limit is hit. If we can ensure no > pool limit hits and zero writeback by allocating more memory, I will > try the same amount of zramswap. > > > Also, have you experimented with the dynamic zswap shrinker? :) I'm > > actually curious how it works out in the small machine regime, with > > whatever workload you are running. > > > > It seems the dynamic shrinker is trying to evict all pages. That does > not fit to my use case that prefer balanced swapin and swapout > performance Hmm not quite. As you have noted earlier, it (tries to) shrink the unprotected pages only, > > > 2024年6月15日(土) 9:20 Yosry Ahmed <yosryahmed@xxxxxxxxxx>: > > > > > > 1. > > > The visible issue is that pageout/in operations from active processes > > > are slow when zswap is near its max pool size. This is particularly > > > significant on small memory systems, where total swap usage exceeds > > > what zswap can store. This means that old pages occupy most of the > > > zswap pool space, and recent pages use swap disk directly. > > > > This should be a transient state though, right? Once the shrinker > > kicks in it should writeback the old pages and make space for the hot > > ones. Which takes us to our next point. > > > > > > > > 2. > > > This issue is caused by zswap keeping the pool size near 100%. Since > > > the shrinker fails to shrink the pool to accept_thr_percent and zswap > > > rejects incoming pages, rejection occurs more frequently than it > > > should. The rejected pages are directly written to disk while zswap > > > protects old pages from eviction, leading to slow pageout/in > > > performance for recent pages on the swap disk. > > > > Why is the shrinker failing? IIUC the first two patches fixes two > > cases where the shrinker stumbles upon offline memcgs, or memcgs with > > no zswapped pages. Are these cases common enough in your use case that > > every single time the shrinker runs it hits MAX_RECLAIM_RETRIES before > > putting the zswap usage below accept_thr_percent? > > > > This would be surprising given that we should be restarting the > > shrinker with every swapout attempt until we can accept pages again. > > > > I guess one could construct a malicious case where there are some > > sticky offline memcgs, and all the memcgs that actually have zswap > > pages come after it in the iteration order. > > > > Could you shed more light about this? What does the setup look like? > > How many memcgs there are, how many of them use zswap, and how many > > offline memcgs are you observing? > > > > Example from ubuntu 22.04 using zswap: > root@ctl:~# find /sys/fs/cgroup/ -wholename > \*service/memory.zswap.current | xargs grep . | wc > 31 31 2557 > root@ctl:~# find /sys/fs/cgroup/ -wholename > \*service/memory.zswap.current | xargs grep ^0 | wc > 11 11 911 > > This indicates 11 out of 31 services have no pages in zswap. Without > patch 2, shrink_worker() aborts shrinking in the second tree walk, > before evicting about 40 pages from the services. The number varies, > but I think it is common to see a few memcg that has no zswap pages > > > I am not saying we shouldn't fix these problems anyway, I am just > > trying to understand how we got into this situation to begin with. > > > > > > > > 3. > > > If the pool size were shrunk proactively, rejection by pool limit hits > > > would be less likely. New incoming pages could be accepted as the pool > > > gains some space in advance, while older pages are written back in the > > > background. zswap would then be filled with recent pages, as expected > > > in the LRU logic. > > > > I suspect if patches 1 and 2 fix your problem, the shrinker invoked > > from reclaim should be doing this sort of "proactive shrinking". > > > > I agree that the current hysteresis around accept_thr_percent is not > > good enough, but I am surprised you are hitting the pool limit if the > > shrinker is being run during reclaim. > > > > > > > > Patch 1 and 2 make the shrinker reduce the pool to accept_thr_percent. > > > Patch 3 makes zswap_store trigger the shrinker before reaching the max > > > pool size. With this series, zswap will prepare some space to reduce > > > the probability of problematic pool_limit_hit situation, thus reducing > > > slow reclaim and the page priority inversion against LRU. > > > > > > 4. > > > Once proactive shrinking reduces the pool size, pageouts complete > > > instantly as long as the space prepared by shrinking can store the > > > direct reclaim. If an admin sees a large pool_limit_hit, lowering > > > accept_threshold_percent will improve active process performance. > > > > I agree that proactive shrinking is preferable to waiting until we hit > > pool limit, then stop taking in pages until the acceptance threshold. > > I am just trying to understand whether such a proactive shrinking > > mechanism will be needed if the reclaim shrinker for zswap is being > > used, how the two would work together. > > For my workload, the dynamic shrinker (reclaim shrinker) is disabled. > The proposed global shrinker and the existing dynamic shrinker are > both proactive, but their goals are different. > > The global shrinker starts shrinking when the zswap pool exceeds > accept_thr_percent + 1%, then stops when it reaches > accept_thr_percent. Pages below accept_thr_percent are protected from > shrinking. > > The dynamic shrinker starts shrinking based on memory pressure > regardless of the zswap pool size, and stops when the LRU size is > reduced to 1/4. Its goal is to wipe out all pages from zswap. It > prefers swapout performance only. > > I think the current LRU logic decreases nr_zswap_protected too quickly > for my workload. In zswap_lru_add(), nr_zswap_protected is reduced to > between 1/4 and 1/8 of the LRU size. Although zswap_folio_swapin() > increments nr_zswap_protected when page-ins of evicted pages occur > later, this technically has no effect while reclaim is in progress. > > While zswap_store() and zswap_lru_add() are called, the dynamic > shrinker is likely running due to the pressure. The dynamic shrinker > reduces the LRU size to 1/4, and then a few subsequent zswap_store() > calls reduce the protected count to 1/4 of the LRU size. The stored > pages will be reduced to zero through a few shrinker_scan calls. Ah this is a fair point. We've been observing this in production/experiments as well - there's seems to be a positive correlation between zswpout rate and zswap_written_back rate. Whenever there's a spike in zswpout, you also see a spike in writtenback pages too - looks like the flood of zswpout weaken zswap's lru protection, which is not quite the intended effect. We're working to improve this situation. We have a couple of ideas floating around, none of which are too complicated to implement, but need experiments to validate before sending upstream :)