Re: [PATCH v6 4/6] zsmalloc: Add a LRU to zs_pool to keep track of zspages in LRU order

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Wed, 23 Nov 2022 00:02:39 -0800

On Tue, Nov 22, 2022 at 7:50 PM Sergey Senozhatsky
<senozhatsky@xxxxxxxxxxxx> wrote:
>
> On (22/11/22 12:42), Johannes Weiner wrote:
> > On Tue, Nov 22, 2022 at 10:52:58AM +0900, Sergey Senozhatsky wrote:
> > > On (22/11/18 16:15), Nhat Pham wrote:
> > > [..]
> > > > @@ -1249,6 +1267,15 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
> > > >   obj_to_location(obj, &page, &obj_idx);
> > > >   zspage = get_zspage(page);
> > > >
> > > > +#ifdef CONFIG_ZPOOL
> > > > + /* Move the zspage to front of pool's LRU */
> > > > + if (mm == ZS_MM_WO) {
> > > > +         if (!list_empty(&zspage->lru))
> > > > +                 list_del(&zspage->lru);
> > > > +         list_add(&zspage->lru, &pool->lru);
> > > > + }
> > > > +#endif
> > >
> > > Do we consider pages that were mapped for MM_RO/MM_RW as cold?
> > > I wonder why, we use them, so technically they are not exactly
> > > "least recently used".
> >
> > This is a swap LRU. Per definition there are no ongoing accesses to
> > the memory while the page is swapped out that would make it "hot".
>
> Hmm. Not arguing, just trying to understand some things.
>
> There are no accesses to swapped out pages yes, but zspage holds multiple
> objects, which are compressed swapped out pages in this particular case.
> For example, zspage in class size 176 (bytes) can hold 93 objects per-zspage,
> that is 93 compressed swapped out pages. Consider ZS_FULL zspages which
> is at the tail of the LRU list. Suppose that we page-faulted 20 times and
> read 20 objects from that zspage, IOW zspage has been in use 20 times very
> recently, while writeback still considers it to be "not-used" and will
> evict it.
>
> So if this works for you then I'm fine. But we probably, like you suggested,
> can document a couple of things here - namely why WRITE access to zspage
> counts as "zspage is in use" but READ access to the same zspage does not
> count as "zspage is in use".
>

I guess the key here is that we have an LRU of zspages, when we really
want an LRU of compressed objects. In some cases, we may end up
reclaiming the wrong pages.

Assuming we have 2 zspages, Z1 and Z2, and 4 physical pages that we
compress over time, P1 -> P4.

Let's assume P1 -> P4 get compressed in order (P4 is the hottest
page), and they get assigned to zspages as follows:
Z1: P1, P3
Z2: P2, P4

In this case, the zspages LRU would be Z2->Z1, because Z2 was touched
last when we compressed P4. Now if we want to writeback, we will look
at Z1, and we might end up reclaiming P3, depending on the order the
pages are stored in.

A worst case scenario of this would be if we have a large number of
pages, maybe 1000, P1->P1000 (where P1000 is the hottest), and they
all go into Z1 and Z2 in this way:
Z1: P1 -> P499, P1000
Z2: P500 -> P999

In this case, Z1 contains 499 cold pages, but it got P1000 at the end
which caused us to put it on the front of the LRU. Now writeback will
consistently use Z2. This is bad. Now I have no idea how practical
this is, but it seems fairly random, based on the compression size of
pages and access patterns.

Does this mean we should move zspages to the front of the LRU when we
writeback from them? No, I wouldn't say so. The same exact scenario
can happen because of this. Imagine the following assignment of the
1000 pages:
Z1: P<odd> (P1, P3, .., P999)
Z2: P<even> (P2, P4, .., P1000)

Z2 is at the front of the LRU because it has P1000, so the first time
we do writeback we will start at Z1. Once we reclaim one object from
Z1, we will start writeback from Z2 next time, and we will keep
alternating. Now if we are really unlucky, we can end up reclaiming in
this order P999, P1000, P997, P998, ... . So yeah I don't think
putting zspages in the front of the LRU when we writeback is the
answer. I would even say it's completely orthogonal to the problem,
because writing back an object from the zspage at the end of the LRU
gives us 0 information about the state of other objects on the same
zspage.

Ideally, we would have an LRU of objects instead, but this would be
very complicated with the current form of writeback. It would be much
easier if we have an LRU for zswap entries instead, which is something
I am looking into, and is a much bigger surgery, and should be
separate from this work. Today zswap inverts LRU priorities anyway by
sending hot pages to the swapfile when zswap is full, when colder
pages are in zswap, so I wouldn't really worry about this now :)