On Tue, Jan 2, 2024 at 3:39 AM Zhongkun He <hezhongkun.hzk@xxxxxxxxxxxxx> wrote: > > On Sat, Dec 30, 2023 at 10:09 AM Nhat Pham <nphamcs@xxxxxxxxx> wrote: > > > > On Tue, Oct 24, 2023 at 7:27 AM Zhongkun He > > <hezhongkun.hzk@xxxxxxxxxxxxx> wrote: > > > > > > > My apologies for the delayed response. I have a couple of questions. > > > > > The zswap_writeback_entry() will add a page to the swap cache, decompress > > > the entry data into the page, and issue a bio write to write the page back > > > to the swap device. Move the page to the tail of lru list through > > > SetPageReclaim(page) and folio_rotate_reclaimable(). > > > > > > Currently, about half of the pages will fail to move to the tail of lru > > > > May I ask what's the downstream effect of this? i.e so what if it > > fails? And yes, as Andrew pointed out, it'd be nice if the patch > > changelog spells out any observable or measurable change from the > > user's POV. > > > > The swap cache page used to decompress zswap_entry should be > moved to the tail of the inactive list after end_writeback, We can release > them in time.Just like the following code in zswap_writeback_entry(). > > /* move it to the tail of the inactive list after end_writeback */ > SetPageReclaim(page); > > After the writeback is over, the function of > folio_rotate_reclaimable() will fail > because the page is not in the LRU list but in some of the cpu folio batches. > Therefore we did not achieve the goal of setting SetPageReclaim(page), and > the pages could not be free in time. > > > > list because there is no LRU flag in page which is not in the LRU list but > > > the cpu_fbatches. So fix it. > > > > This sentence is a bit confusing to me. Does this mean the page > > currently being processed for writeback is not in the LRU list > > (!PageLRU(page)), but IN one of the cpu folio batches? Which makes > > folio_rotate_reclaimable() fails on this page later on in the > > _swap_writepage() path? (hence the necessity of lru_add_drain()?) > > > > Yes, exactly. > > > Let me know if I'm misunderstanding the intention of this patch. I > > know it's a bit pedantic, but spelling things out (ideally in the > > changelog itself) will help the reviewers, as well as future > > contributors who want to study the codebase and make changes to it. > > > > Sorry,my bad. > > > > > > > Signed-off-by: Zhongkun He <hezhongkun.hzk@xxxxxxxxxxxxx> > > > > Thanks and look forward to your response, > > Nhat > > > > P/S: Have a nice holiday season and happy new year! > > Here are the steps and results of the performance test: > 1:zswap+ zram (simplified model with on IO) > 2:disabel zswap/parameters/same_filled_pages_enabled (stress have same pages) > 3:stress --vm 1 --vm-bytes 2g --vm-hang 0 (2Gi anon pages) > 4: In order to quickly release zswap_entry, I used the previous > patch (I will send it again later). > https://lore.kernel.org/all/20231025095248.458789-1-hezhongkun.hzk@xxxxxxxxxxxxx/ > > Performance result: > reclaim 1Gi zswap_entry > > time echo 1 > writeback_time_threshold > (will release the zswap_entry, not been accessed for more than 1 seconds ) > > Base With this patch > real 0m1.015s real 0m1.043s > user 0m0.000s user 0m0.001s > sys 0m1.013s sys 0m1.040s > So no obvious performance regression was found. That's around 2.7% increase in real time, no? Admittedly, this micro-benchmark is too small to conclude either way, but the data doesn't seem to be in your favor. I'm a bit concerned about the overhead here, given that with this patch we will drain the per-cpu batch on every written-back entry. That's quite a high frequency, especially since we're moving towards more writeback (either with the new zswap shrinker, or your time threshold-based writeback mechanism). For instance, there seems to be some (local/per-cpu) locking involved, no? Could there be some form of lock contentions there (especially since with the new shrinker, you can have a lot of concurrent writeback contexts?) Furthermore, note that a writeback from zswap to swap saves less memory than a writeback from memory to swap, so the effect of the extra overhead will be even more pronounced. That is, the amount of extra work done (from this change) to save one unit of memory would be even larger than if we call lru_add_drain() every time we swap out a page (from memory -> swap). And I'm pretty sure we don't call lru_add_drain() every time we swap out a page - I believe we call lru_add_drain() every time we perform a shrink action. For e.g, in shrink_inactive_list(). That's much coarser in granularity than here. Also, IIUC, the more often we perform lru_add_drain(), the less batching effect we will obtain. IOW, the overhead of maintaining the batch will become higher relative to the performance gains from batching. Maybe I'm missing something - could you walk me through how lru_add_drain() is fine here, from this POV? Thanks! > > After writeback, we perform the following steps to release the memory again > echo 1g > memory.reclaim > > Base: > total used recalim total used > Mem: 38Gi 2.5Gi ----> 38Gi 1.5Gi > Swap: 5.0Gi 1.0Gi ----> 5Gi 1.5Gi > used memory -1G swap +0.5g > It means that half of the pages are failed to move to the tail of lru list, > So we need to release an additional 0.5Gi anon pages to swap space. > > With this patch: > total used recalim total used > Mem: 38Gi 2.6Gi ----> 38Gi 1.6Gi > Swap: 5.0Gi 1.0Gi ----> 5Gi 1Gi > > used memory -1Gi, swap +0Gi > It means that we release all the pages which have been add to the tail of > lru list in zswap_writeback_entry() and folio_rotate_reclaimable(). > OTOH, this suggests that we're onto something. Swap usage seems to decrease quite a bit. Sounds like a real problem that this patch is tackling. (Please add this benchmark result to future changelog. It'll help demonstrate the problem). I'm inclined to ack this patch, but it'd be nice if you can assuage my concerns above (with some justification and/or larger benchmark). (Or perhaps, we have to drain, but less frequently/higher up the stack?) Thanks, Nhat > > Thanks for your time Nhat and Andrew. Happy New Year!