Re: [External] Re: [PATCH] mm: zswap: fix the lack of page lru flag in zswap_writeback_entry

Nhat Pham <nphamcs@xxxxxxxxx> · Thu, 4 Jan 2024 11:42:58 -0800

On Wed, Jan 3, 2024 at 6:12 AM Zhongkun He <hezhongkun.hzk@xxxxxxxxxxxxx> wrote:
>
> > That's around 2.7% increase in real time, no? Admittedly, this
> > micro-benchmark is too small to conclude either way, but the data
> > doesn't seem to be in your favor.
> >
> > I'm a bit concerned about the overhead here, given that with this
> > patch we will drain the per-cpu batch on every written-back entry.
> > That's quite a high frequency, especially since we're moving towards
> > more writeback (either with the new zswap shrinker, or your time
> > threshold-based writeback mechanism). For instance, there seems to be
> > some (local/per-cpu) locking involved, no? Could there be some form of
> > lock contentions there (especially since with the new shrinker, you
> > can have a lot of concurrent writeback contexts?)
> >
> > Furthermore, note that a writeback from zswap to swap saves less
> > memory than a writeback from memory to swap, so the effect of the
> > extra overhead will be even more pronounced. That is, the amount of
> > extra work done (from this change) to save one unit of memory would be
> > even larger than if we call lru_add_drain() every time we swap out a
> > page (from memory -> swap). And I'm pretty sure we don't call
> > lru_add_drain() every time we swap out a page - I believe we call
> > lru_add_drain() every time we perform a shrink action. For e.g, in
> > shrink_inactive_list(). That's much coarser in granularity than here.
> >
> > Also, IIUC, the more often we perform lru_add_drain(), the less
> > batching effect we will obtain. IOW, the overhead of maintaining the
> > batch will become higher relative to the performance gains from
> > batching.
> >
> > Maybe I'm missing something - could you walk me through how
> > lru_add_drain() is fine here, from this POV? Thanks!
> >
> > >
> > > After writeback, we perform the following steps to release the memory again
> > > echo 1g > memory.reclaim
> > >
> > > Base:
> > >                      total        used        recalim        total        used
> > > Mem:           38Gi       2.5Gi       ---->             38Gi       1.5Gi
> > > Swap:         5.0Gi       1.0Gi       ---->              5Gi        1.5Gi
> > > used memory  -1G   swap +0.5g
> > > It means that  half of the pages are failed to move to the tail of lru list,
> > > So we need to release an additional 0.5Gi anon pages to swap space.
> > >
> > > With this patch:
> > >                      total        used        recalim        total        used
> > > Mem:           38Gi       2.6Gi       ---->             38Gi       1.6Gi
> > > Swap:         5.0Gi       1.0Gi       ---->              5Gi        1Gi
> > >
> > > used memory  -1Gi,  swap +0Gi
> > > It means that we release all the pages which have been add to the tail of
> > > lru list in zswap_writeback_entry() and folio_rotate_reclaimable().
> > >
> >
> > OTOH, this suggests that we're onto something. Swap usage seems to
> > decrease quite a bit. Sounds like a real problem that this patch is
> > tackling.
> > (Please add this benchmark result to future changelog. It'll help
> > demonstrate the problem).
>
> Yes
>
> >
> > I'm inclined to ack this patch, but it'd be nice if you can assuage my
> > concerns above (with some justification and/or larger benchmark).
> >
>
> OK，thanks.
>
> > (Or perhaps, we have to drain, but less frequently/higher up the stack?)
> >
>
> I've reviewed the code again and have no idea. It would be better if
> you have any suggestions.

Hmm originally I was thinking of doing an (unconditional)
lru_add_drain() outside of zswap_writeback_entry() - once in
shrink_worker() and/or zswap_shrinker_scan(), before we write back any
of the entries. Not sure if it would work/help here tho - haven't
tested that idea yet.

>
> New test:
> This patch will add the execution of folio_rotate_reclaimable(not executed
> without this patch) and lru_add_drain,including percpu lock competition.
> I bind a new task to allocate memory and use the same batch lock to compete
> with the target process, on the same CPU.
> context:
> 1:stress --vm 1 --vm-bytes 1g    (bind to cpu0)
> 2:stress --vm 1 --vm-bytes 5g --vm-hang 0（bind to cpu0）
> 3:reclaim pages, and writeback 5G zswap_entry in cpu0 and node 0.
>
> Average time of five tests
>
> Base      patch            patch + compete
> 4.947      5.0676          5.1336
>                 +2.4%          +3.7%
> compete means: a new stress run in cpu0 to compete with the writeback process.
> PID USER        %CPU  %MEM     TIME+ COMMAND                         P
>  1367 root         49.5      0.0       1:09.17 bash     （writeback）            0
>  1737 root         49.5      2.2       0:27.46 stress      (use percpu
> lock)    0
>
> around 2.4% increase in real time,including the execution of
> folio_rotate_reclaimable(not executed without this patch) and lru_add_drain,but
> no lock contentions.

Hmm looks like the regression is still there, no?

>
> around 1.3% additional  increase in real time with lock contentions on the same
> cpu.
>
> There is another option here, which is not to move the page to the
> tail of the inactive
> list after end_writeback and delete the following code in
> zswap_writeback_entry(),
> which did not work properly. But the pages will not be released first.
>
> /* move it to the tail of the inactive list after end_writeback */
> SetPageReclaim(page);

Or only SetPageReclaim on pages on LRU?

>
> Thanks，
> Zhongkun
>
> > Thanks,
> > Nhat
> >
> > >
> > > Thanks for your time Nhat and Andrew. Happy New Year!