On Tue, Nov 22, 2022 at 7:50 PM Sergey Senozhatsky <senozhatsky@xxxxxxxxxxxx> wrote: > > On (22/11/22 12:42), Johannes Weiner wrote: > > On Tue, Nov 22, 2022 at 10:52:58AM +0900, Sergey Senozhatsky wrote: > > > On (22/11/18 16:15), Nhat Pham wrote: > > > [..] > > > > @@ -1249,6 +1267,15 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle, > > > > obj_to_location(obj, &page, &obj_idx); > > > > zspage = get_zspage(page); > > > > > > > > +#ifdef CONFIG_ZPOOL > > > > + /* Move the zspage to front of pool's LRU */ > > > > + if (mm == ZS_MM_WO) { > > > > + if (!list_empty(&zspage->lru)) > > > > + list_del(&zspage->lru); > > > > + list_add(&zspage->lru, &pool->lru); > > > > + } > > > > +#endif > > > > > > Do we consider pages that were mapped for MM_RO/MM_RW as cold? > > > I wonder why, we use them, so technically they are not exactly > > > "least recently used". > > > > This is a swap LRU. Per definition there are no ongoing accesses to > > the memory while the page is swapped out that would make it "hot". > > Hmm. Not arguing, just trying to understand some things. > > There are no accesses to swapped out pages yes, but zspage holds multiple > objects, which are compressed swapped out pages in this particular case. > For example, zspage in class size 176 (bytes) can hold 93 objects per-zspage, > that is 93 compressed swapped out pages. Consider ZS_FULL zspages which > is at the tail of the LRU list. Suppose that we page-faulted 20 times and > read 20 objects from that zspage, IOW zspage has been in use 20 times very > recently, while writeback still considers it to be "not-used" and will > evict it. > > So if this works for you then I'm fine. But we probably, like you suggested, > can document a couple of things here - namely why WRITE access to zspage > counts as "zspage is in use" but READ access to the same zspage does not > count as "zspage is in use". > I guess the key here is that we have an LRU of zspages, when we really want an LRU of compressed objects. In some cases, we may end up reclaiming the wrong pages. Assuming we have 2 zspages, Z1 and Z2, and 4 physical pages that we compress over time, P1 -> P4. Let's assume P1 -> P4 get compressed in order (P4 is the hottest page), and they get assigned to zspages as follows: Z1: P1, P3 Z2: P2, P4 In this case, the zspages LRU would be Z2->Z1, because Z2 was touched last when we compressed P4. Now if we want to writeback, we will look at Z1, and we might end up reclaiming P3, depending on the order the pages are stored in. A worst case scenario of this would be if we have a large number of pages, maybe 1000, P1->P1000 (where P1000 is the hottest), and they all go into Z1 and Z2 in this way: Z1: P1 -> P499, P1000 Z2: P500 -> P999 In this case, Z1 contains 499 cold pages, but it got P1000 at the end which caused us to put it on the front of the LRU. Now writeback will consistently use Z2. This is bad. Now I have no idea how practical this is, but it seems fairly random, based on the compression size of pages and access patterns. Does this mean we should move zspages to the front of the LRU when we writeback from them? No, I wouldn't say so. The same exact scenario can happen because of this. Imagine the following assignment of the 1000 pages: Z1: P<odd> (P1, P3, .., P999) Z2: P<even> (P2, P4, .., P1000) Z2 is at the front of the LRU because it has P1000, so the first time we do writeback we will start at Z1. Once we reclaim one object from Z1, we will start writeback from Z2 next time, and we will keep alternating. Now if we are really unlucky, we can end up reclaiming in this order P999, P1000, P997, P998, ... . So yeah I don't think putting zspages in the front of the LRU when we writeback is the answer. I would even say it's completely orthogonal to the problem, because writing back an object from the zspage at the end of the LRU gives us 0 information about the state of other objects on the same zspage. Ideally, we would have an LRU of objects instead, but this would be very complicated with the current form of writeback. It would be much easier if we have an LRU for zswap entries instead, which is something I am looking into, and is a much bigger surgery, and should be separate from this work. Today zswap inverts LRU priorities anyway by sending hot pages to the swapfile when zswap is full, when colder pages are in zswap, so I wouldn't really worry about this now :)