2024年7月9日(火) 9:53 Nhat Pham <nphamcs@xxxxxxxxx>: > > post-patch, 6.10-rc4 with patch 1 to 5 > > You mean 1 to 6? There are 6 patches, no? oops. with patches 1 to 6. > > Just out of pure curiosity, could you include the stats from patch 1-3 only? > I will rerun the bench in v3. I assume this bench does not reflect patches 4 to 6, as delta pool_limit_hit=0 means no rejection from zswap. > Ah this is interesting. Did you actually see improvement in your real > deployment (i.e not the benchmark) with patch 4-6 in? > As replied in patch 6, memory consuming tasks like `apt upgrade` for instance. > > > > Intended scenario for memory reclaim: > > 1. zswap pool < accept_threshold as the initial state. This is achieved > > by patch 3, proactive shrinking. > > 2. Active processes start allocating pages. Pageout is buffered by zswap > > without IO. > > 3. zswap reaches shrink_start_threshold. zswap continues to buffer > > incoming pages and starts writeback immediately in the background. > > 4. zswap reaches max pool size. zswap interrupts the global shrinker and > > starts rejecting pages. Write IO for the rejected page will consume > > all IO resources. > > This sounds like the proactive shrinker is still not aggressive > enough, and/or there are some sort of misspecifications of the zswap > setting... Correct me if I'm wrong, but the new proactive global > shrinker begins 1% after the acceptance threshold, and shrinks down to > acceptance threshold, right? How are we still hitting the pool > limit... > Proactive shrinking should not be aggressive. With patches 4 and 6, I modified the global shrinker to be less aggressive against pagein/out. Shrinking proactively cannot avoid hitting the pool limit when memory pressure grows faster. > My concern is that we are knowingly (and perhaps unnecessarily) > creating an LRU inversion here - preferring swapping out the rejected > pages over the colder pages in the zswap pool. Shouldn't it be the > other way around? For instance, can we spiral into the following > scenario: > > 1. zswap pool becomes full. > 2. Memory is still tight, so anonymous memory will be reclaimed. zswap > keeps rejecting incoming pages, and putting a hold on the global > shrinker. > 3. The pages that are swapped out are warmer than the ones stored in > the zswap pool, so they will be more likely to be swapped in (which, > IIUC, will also further delay the global shrinker). > > and the cycle keeps going on and on? I agree this does not follow LRU, but I think the LRU priority inversion is unavoidable once the pool limit is hit. The accept_thr_percent should be lowered to reduce the probability of LRU inversion if it matters. (it is why I implemented proactive shrinker.) When the writeback throughput is slower than memory usage grows, zswap_store() will have to reject pages sooner or later. If we evict the oldest stored pages synchronously before rejecting a new page (rotating pool to keep LRU), it will affect latency depending how much writeback is required to store the new page. If the oldest pages were compressed well, we would have to evict too many pages to store a warmer page, which blocks the reclaim progress. Fragmentation in the zspool may also increase the required writeback amount. We cannot accomplish both maintaining LRU priority and maintaining pageout latency. Additionally, zswap_writeback_entry() is slower than direct pageout. I assume this is because shrinker performs 4KB IO synchronously. I am seeing shrinking throughput is limited by disk IOPS * 4KB while much higher throughput can be achieved by disabling zswap. direct pageout can be faster than zswap writeback, possibly because of bio optimization or sequential allocation of swap. > Have you experimented with synchronous reclaim in the case the pool is > full? All the way to the acceptance threshold is too aggressive of > course - you might need to find something in between :) > I don't get what the expected situation is. The benchmark of patch 6 is performing synchronous reclaim in the case the pool is full, since bulk memory allocation (write to mmapped space) is much faster than writeback throughput. The zswap pool is filled instantly at the beginning of benchmark runs. The accept_thr_percent is not significant for the benchmark, I think. > > I wonder if this contention would show up in PSI metrics > (/proc/pressure/io, or the cgroup variants if you use them ). Maybe > correlate reclaim counters (pgscan, zswpout, pswpout, zswpwb etc.) > with IO pressure to show the pattern, i.e the contention problem was > there before, and is now resolved? :) Unfortunately, I could not find a reliable metric other than elapsed time. It seems PSI does not distinguish stalls for rejected pageout from stalls for shrinker writeback. For counters, this issue affects latency but does not increase the number of pagein/out. Is there any better way to observe the origin of contention? Thanks.