On Tue, Sep 26, 2023 at 11:37:17AM -0700, Yosry Ahmed wrote: > On Tue, Sep 26, 2023 at 11:24 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > > > On Mon, Sep 25, 2023 at 01:17:04PM -0700, Yosry Ahmed wrote: > > > +Chris Li > > > > > > On Tue, Sep 19, 2023 at 10:14 AM Nhat Pham <nphamcs@xxxxxxxxx> wrote: > > > > > > > > From: Domenico Cerasuolo <cerasuolodomenico@xxxxxxxxx> > > > > > > > > Currently, we only have a single global LRU for zswap. This makes it > > > > impossible to perform worload-specific shrinking - an memcg cannot > > > > determine which pages in the pool it owns, and often ends up writing > > > > pages from other memcgs. This issue has been previously observed in > > > > practice and mitigated by simply disabling memcg-initiated shrinking: > > > > > > > > https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@xxxxxxxxx/T/#u > > > > > > > > This patch fully resolves the issue by replacing the global zswap LRU > > > > with memcg- and NUMA-specific LRUs, and modify the reclaim logic: > > > > > > > > a) When a store attempt hits an memcg limit, it now triggers a > > > > synchronous reclaim attempt that, if successful, allows the new > > > > hotter page to be accepted by zswap. > > > > b) If the store attempt instead hits the global zswap limit, it will > > > > trigger an asynchronous reclaim attempt, in which an memcg is > > > > selected for reclaim in a round-robin-like fashion. > > > > > > Hey Nhat, > > > > > > I didn't take a very close look as I am currently swamped, but going > > > through the patch I have some comments/questions below. > > > > > > I am not very familiar with list_lru, but it seems like the existing > > > API derives the node and memcg from the list item itself. Seems like > > > we can avoid a lot of changes if we allocate struct zswap_entry from > > > the same node as the page, and account it to the same memcg. Would > > > this be too much of a change or too strong of a restriction? It's a > > > slab allocation and we will free memory on that node/memcg right > > > after. > > > > My 2c, but I kind of hate that assumption made by list_lru. > > > > We ran into problems with it with the THP shrinker as well. That one > > strings up 'struct page', and virt_to_page(page) results in really fun > > to debug issues. > > > > IMO it would be less error prone to have memcg and nid as part of the > > regular list_lru_add() function signature. And then have an explicit > > list_lru_add_obj() that does a documented memcg lookup. > > I also didn't like/understand that assumption, but again I don't have > enough familiarity with the code to judge, and I don't know why it was > done that way. Adding memcg and nid as arguments to the standard > list_lru API makes the pill easier to swallow. In any case, this > should be done in a separate patch to make the diff here more focused > on zswap changes. Just like the shrinker, it was initially written for slab objects like dentries and inodes. Since all the users then passed in charged slab objects, it ended up hardcoding that assumption in the interface. Once that assumption no longer holds, IMO we should update the interface. But agreed, a separate patch makes sense. > > Because of the overhead, we've been selective about the memory we > > charge. I'd hesitate to do it just to work around list_lru. > > On the other hand I am worried about the continuous growth of struct > zswap_entry. It's now at ~10 words on 64-bit? That's ~2% of the size > of the page getting compressed if I am not mistaken. So I am skeptical > about storing the nid there. > > A middle ground would be allocating struct zswap_entry on the correct > node without charging it. We don't need to store the nid and we don't > need to charge struct zswap_entry. It doesn't get rid of > virt_to_page() though. I like that idea. The current list_lru_add() would do the virt_to_page() too, so from that POV it's a lateral move. But we'd save the charging and the extra nid member.