On Mon, Oct 16, 2023 at 9:47 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > On Mon, Oct 16, 2023 at 05:57:31PM -0700, Yosry Ahmed wrote: > > On Mon, Oct 16, 2023 at 5:35 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote: > > > > > > Currently, when a zswap store attempt fails, the page is immediately > > > swapped out. This could happen for a variety of reasons. For instance, > > > the compression algorithm could fail (such as when the data is not > > > compressible), or the backend allocator might not be able to find a > > > suitable slot for the compressed page. If these pages are needed > > > later on, users will incur IOs from swapins. > > > > > > This issue prevents the adoption of zswap for potential users who > > > cannot tolerate the latency associated with swapping. In many cases, > > > these IOs are avoidable if we just keep in memory the pages that zswap > > > fail to store. > > > > > > This patch series add two new features for zswap that will alleviate > > > the risk of swapping: > > > > > > a) When a store attempt fail, keep the page untouched in memory > > > instead of swapping it out. > > > > What about writeback when the zswap limit is hit? I understand the > > problem, but I am wondering if this is the correct way of fixing it. > > We really need to make zswap work without a backing swapfile, which I > > think is the correct way to fix all these problems. I was working on > > that, but unfortunately I had to pivot to something else before I had > > something that was working. > > > > At Google, we have "ghost" swapfiles that we use just to use zswap > > without a swapfile. They are sparse files, and we have internal kernel > > patches to flag them and never try to actually write to them. > > > > I am not sure how many bandaids we can afford before doing the right > > thing. I understand it's a much larger surgery, perhaps there is a way > > to get a short-term fix that is also a step towards the final state we > > want to reach instead? > > I agree it should also stop writeback due to the limit. > > Note that a knob like this is still useful even when zswap space is > decoupled from disk swap slots. We still are using disk swap broadly > in the fleet as well, so a static ghost file setup wouldn't be a good > solution for us. Swapoff with common swapfile sizes is often not an > option during runtime, due to how slow it is, and the destabilizing > effect it can have on the system unless that's basically completely > idle. As such, we expect to continue deploying swap files for physical > use, and switch the zswap-is-terminal knob depending on the workload. > > The other aspect to this is that workloads that do not want the > swapout/swapin overhead for themselves are usually co-located with > system management software, and/or can share the host with less > latency sensitive workloads, that should continue to use disk swap. > > Due to the latter case, I wonder if a global knob is actually > enough. More likely we'd need per-cgroup control over this. > > [ It's at this point where the historic coupling of zswap to disk > space is especially unfortunate. Because of it, zswap usage counts > toward the memory.swap allowance. If these were separate, we could > have easily set memory.zswap.max=max, memory.swap.max=0 to achieve > the desired effect. > > Alas, that ship has sailed. Even after a decoupling down the line it > would be difficult to change established memory.swap semantics. ] > > So I obviously agree that we still need to invest in decoupling zswap > space from physical disk slots. It's insanely wasteful, especially > with larger memory capacities. But while it would be a fantastic > optimization, I don't see how it would be an automatic solution to the > problem that inspired this proposal. > > We still need some way to reasonably express desired workload policy > in a setup that supports multiple, simultaneous modes of operation. > > > > b) If the store attempt fails at the compression step, allow the page > > > to be stored in its uncompressed form in the zswap pool. This maintains > > > the LRU ordering of pages, which will be helpful for accurate > > > memory reclaim (zswap writeback in particular). > > > > This is dangerous. Johannes and I discussed this before. This means > > that reclaim can end up allocating more memory instead of freeing. > > Allocations made in the reclaim path are made under the assumption > > that we will eventually free memory. In this case, we won't. In the > > worst case scenario, reclaim can leave the system/memcg in a worse > > state than before it started. > > Yeah, this is a concern. It's not such a big deal if it's only a few > pages, and we're still shrinking the footprint on aggregate. But it's > conceivable this can happen systematically with some datasets, in > which case reclaim will drive up the memory consumption and cause > OOMs, or potentially deplete the reserves with PF_MEMALLOC and cause > memory deadlocks. > > This isn't something we can reasonably accept as worst-case behavior. > > > Perhaps there is a way we can do this without allocating a zswap entry? > > > > I thought before about having a special list_head that allows us to > > use the lower bits of the pointers as markers, similar to the xarray. > > The markers can be used to place different objects on the same list. > > We can have a list that is a mixture of struct page and struct > > zswap_entry. I never pursued this idea, and I am sure someone will > > scream at me for suggesting it. Maybe there is a less convoluted way > > to keep the LRU ordering intact without allocating memory on the > > reclaim path. > > That should work. Once zswap has exclusive control over the page, it > is free to muck with its lru linkage. A lower bit tag on the next or > prev pointer should suffice to distinguish between struct page and > struct zswap_entry when pulling stuff from the list. Hmm I'm a bit iffy about pointers bit hacking, but I guess that's the least involved way to store the uncompressed page in the zswap LRU without allocating a zswap_entry for it. Lemme give this a try. If it looks decently clean I'll send it out :) > > We'd also have to teach vmscan.c to hand off the page. It currently > expects that it either frees the page back to the allocator, or puts > it back on the LRU. We'd need a compromise where it continues to tear > down the page and remove the mapping, but then leaves it to zswap. > > Neither of those sound impossible. But since it's a bigger > complication than this proposal, it probably needs a new cost/benefit > analysis, with potentially more data on the problem of LRU inversions. > > Thanks for your insightful feedback, Yosry.