On Mon, Feb 27, 2023 at 8:29 PM Kalesh Singh <kaleshsingh@xxxxxxxxxx> wrote: > > On Wed, Feb 22, 2023 at 2:47 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > > > On Wed, Feb 22, 2023 at 8:57 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > > > > > Hello, > > > > > > thanks for proposing this, Yosry. I'm very interested in this > > > work. Unfortunately, I won't be able to attend LSFMMBPF myself this > > > time around due to a scheduling conflict :( > > > > Ugh, would have been great to have you, I guess there might be a > > remote option, or we will end up discussing on the mailing list > > eventually anyway. > > > > > > > > On Tue, Feb 21, 2023 at 03:38:57PM -0800, Yosry Ahmed wrote: > > > > On Tue, Feb 21, 2023 at 3:34 PM Yang Shi <shy828301@xxxxxxxxx> wrote: > > > > > > > > > > On Tue, Feb 21, 2023 at 11:46 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > > > > > > > > > > > On Tue, Feb 21, 2023 at 11:26 AM Yang Shi <shy828301@xxxxxxxxx> wrote: > > > > > > > > > > > > > > On Tue, Feb 21, 2023 at 10:56 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > > > > > > > > > > > > > > > On Tue, Feb 21, 2023 at 10:40 AM Yang Shi <shy828301@xxxxxxxxx> wrote: > > > > > > > > > > > > > > > > > > Hi Yosry, > > > > > > > > > > > > > > > > > > Thanks for proposing this topic. I was thinking about this before but > > > > > > > > > I didn't make too much progress due to some other distractions, and I > > > > > > > > > got a couple of follow up questions about your design. Please see the > > > > > > > > > inline comments below. > > > > > > > > > > > > > > > > Great to see interested folks, thanks! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Sat, Feb 18, 2023 at 2:39 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > > > > > > > > > > > > > > > > > > > Hello everyone, > > > > > > > > > > > > > > > > > > > > I would like to propose a topic for the upcoming LSF/MM/BPF in May > > > > > > > > > > 2023 about swap & zswap (hope I am not too late). > > > > > > > > > > > > > > > > > > > > ==================== Intro ==================== > > > > > > > > > > Currently, using zswap is dependent on swapfiles in an unnecessary > > > > > > > > > > way. To use zswap, you need a swapfile configured (even if the space > > > > > > > > > > will not be used) and zswap is restricted by its size. When pages > > > > > > > > > > reside in zswap, the corresponding swap entry in the swapfile cannot > > > > > > > > > > be used, and is essentially wasted. We also go through unnecessary > > > > > > > > > > code paths when using zswap, such as finding and allocating a swap > > > > > > > > > > entry on the swapout path, or readahead in the swapin path. I am > > > > > > > > > > proposing a swapping abstraction layer that would allow us to remove > > > > > > > > > > zswap's dependency on swapfiles. This can be done by introducing a > > > > > > > > > > data structure between the actual swapping implementation (swapfiles, > > > > > > > > > > zswap) and the rest of the MM code. > > > > > > > > > > > > > > > > > > > > ==================== Objective ==================== > > > > > > > > > > Enabling the use of zswap without a backing swapfile, which makes > > > > > > > > > > zswap useful for a wider variety of use cases. Also, when zswap is > > > > > > > > > > used with a swapfile, the pages in zswap do not use up space in the > > > > > > > > > > swapfile, so the overall swapping capacity increases. > > > > > > > > > > > > > > > > > > > > ==================== Idea ==================== > > > > > > > > > > Introduce a data structure, which I currently call a swap_desc, as an > > > > > > > > > > abstraction layer between swapping implementation and the rest of MM > > > > > > > > > > code. Page tables & page caches would store a swap id (encoded as a > > > > > > > > > > swp_entry_t) instead of directly storing the swap entry associated > > > > > > > > > > with the swapfile. This swap id maps to a struct swap_desc, which acts > > > > > > > > > > as our abstraction layer. All MM code not concerned with swapping > > > > > > > > > > details would operate in terms of swap descs. The swap_desc can point > > > > > > > > > > to either a normal swap entry (associated with a swapfile) or a zswap > > > > > > > > > > entry. It can also include all non-backend specific operations, such > > > > > > > > > > as the swapcache (which would be a simple pointer in swap_desc), swap > > > > > > > > > > counting, etc. It creates a clear, nice abstraction layer between MM > > > > > > > > > > code and the actual swapping implementation. > > > > > > > > > > > > > > > > > > How will the swap_desc be allocated? Dynamically or preallocated? Is > > > > > > > > > it 1:1 mapped to the swap slots on swap devices (whatever it is > > > > > > > > > backed, for example, zswap, swap partition, swapfile, etc)? > > > > > > > > > > > > > > > > I imagine swap_desc's would be dynamically allocated when we need to > > > > > > > > swap something out. When allocated, a swap_desc would either point to > > > > > > > > a zswap_entry (if available), or a swap slot otherwise. In this case, > > > > > > > > it would be 1:1 mapped to swapped out pages, not the swap slots on > > > > > > > > devices. > > > > > > > > > > > > > > It makes sense to be 1:1 mapped to swapped out pages if the swapfile > > > > > > > is used as the back of zswap. > > > > > > > > > > > > > > > > > > > > > > > I know that it might not be ideal to make allocations on the reclaim > > > > > > > > path (although it would be a small-ish slab allocation so we might be > > > > > > > > able to get away with it), but otherwise we would have statically > > > > > > > > allocated swap_desc's for all swap slots on a swap device, even unused > > > > > > > > ones, which I imagine is too expensive. Also for things like zswap, it > > > > > > > > doesn't really make sense to preallocate at all. > > > > > > > > > > > > > > Yeah, it is not perfect to allocate memory in the reclamation path. We > > > > > > > do have such cases, but the fewer the better IMHO. > > > > > > > > > > > > Yeah. Perhaps we can preallocate a pool of swap_desc's on top of the > > > > > > slab cache, idk if that makes sense, or if there is a way to tell slab > > > > > > to proactively refill a cache. > > > > > > > > > > > > I am open to suggestions here. I don't think we should/can preallocate > > > > > > the swap_desc's, and we cannot completely eliminate the allocations in > > > > > > the reclaim path. We can only try to minimize them through caching, > > > > > > etc. Right? > > > > > > > > > > Yeah, reallocation should not work. But I'm not sure whether caching > > > > > works well for this case or not either. I'm supposed that you were > > > > > thinking about something similar with pcp. When the available number > > > > > of elements is lower than a threshold, refill the cache. It should > > > > > work well with moderate memory pressure. But I'm not sure how it would > > > > > behave with severe memory pressure, particularly when anonymous > > > > > memory dominated the memory usage. Or maybe dynamic allocation works > > > > > well, we are just over-engineered. > > > > > > > > Yeah it would be interesting to look into whether the swap_desc > > > > allocation will be a bottleneck. Definitely something to look out for. > > > > I share your thoughts about wanting to do something about it but also > > > > not wanting to over-engineer it. > > > > > > I'm not too concerned by this. It's a PF_MEMALLOC allocation, meaning > > > it's not subject to watermarks. And the swapped page is freed right > > > afterwards. As long as the compression delta exceeds the size of > > > swap_desc, the process is a net reduction in allocated memory. For > > > regular swap, the only requirement is that swap_desc < page_size() :-) > > > > > > To put this into perspective, the zswap backends allocate backing > > > pages on-demand during reclaim. zsmalloc also kmallocs metadata in > > > that path. We haven't had any issues with this in production, even > > > under fairly severe memory pressure scenarios. > > > > Right. The only problem would be for pages that do not compress well > > in zswap, in which case we might not end up freeing memory. As you > > said, this is already happening today with zswap tho. > > > > > > > > > > > > > > > ==================== Benefits ==================== > > > > > > > > > > This work enables using zswap without a backing swapfile and increases > > > > > > > > > > the swap capacity when zswap is used with a swapfile. It also creates > > > > > > > > > > a separation that allows us to skip code paths that don't make sense > > > > > > > > > > in the zswap path (e.g. readahead). We get to drop zswap's rbtree > > > > > > > > > > which might result in better performance (less lookups, less lock > > > > > > > > > > contention). > > > > > > > > > > > > > > > > > > > > The abstraction layer also opens the door for multiple cleanups (e.g. > > > > > > > > > > removing swapper address spaces, removing swap count continuation > > > > > > > > > > code, etc). Another nice cleanup that this work enables would be > > > > > > > > > > separating the overloaded swp_entry_t into two distinct types: one for > > > > > > > > > > things that are stored in page tables / caches, and for actual swap > > > > > > > > > > entries. In the future, we can potentially further optimize how we use > > > > > > > > > > the bits in the page tables instead of sticking everything into the > > > > > > > > > > current type/offset format. > > > > > > > > > > > > > > > > > > > > Another potential win here can be swapoff, which can be more practical > > > > > > > > > > by directly scanning all swap_desc's instead of going through page > > > > > > > > > > tables and shmem page caches. > > > > > > > > > > > > > > > > > > > > Overall zswap becomes more accessible and available to a wider range > > > > > > > > > > of use cases. > > > > > > > > > > > > > > > > > > How will you handle zswap writeback? Zswap may writeback to the backed > > > > > > > > > swap device IIUC. Assuming you have both zswap and swapfile, they are > > > > > > > > > separate devices with this design, right? If so, is the swapfile still > > > > > > > > > the writeback target of zswap? And if it is the writeback target, what > > > > > > > > > if swapfile is full? > > > > > > > > > > > > > > > > When we try to writeback from zswap, we try to allocate a swap slot in > > > > > > > > the swapfile, and switch the swap_desc to point to that instead. The > > > > > > > > process would be transparent to the rest of MM (page tables, page > > > > > > > > cache, etc). If the swapfile is full, then there's really nothing we > > > > > > > > can do, reclaim fails and we start OOMing. I imagine this is the same > > > > > > > > behavior as today when swap is full, the difference would be that we > > > > > > > > have to fill both zswap AND the swapfile to get to the OOMing point, > > > > > > > > so an overall increased swapping capacity. > > > > > > > > > > > > > > When zswap is full, but swapfile is not yet, will the swap try to > > > > > > > writeback zswap to swapfile to make more room for zswap or just swap > > > > > > > out to swapfile directly? > > > > > > > > > > > > > > > > > > > The current behavior is that we swap to swapfile directly in this > > > > > > case, which is far from ideal as we break LRU ordering by skipping > > > > > > zswap. I believe this should be addressed, but not as part of this > > > > > > effort. The work to make zswap respect the LRU ordering by writing > > > > > > back from zswap to make room can be done orthogonal to this effort. I > > > > > > believe Johannes was looking into this at some point. > > > > > > Actually, zswap already does LRU writeback when the pool is full. Nhat > > > Pham (CCd) recently upstreamed the LRU implementation for zsmalloc, so > > > as of today all backends support this. > > > > > > There are still a few quirks in zswap that can cause rejections which > > > bypass the LRU that need fixing. But for the most part LRU writeback > > > to the backing file is the default behavior. > > > > Right, I was specifically talking about this case. When zswap is full > > it rejects incoming pages and they go directly to the swapfile, but we > > also kickoff writeback, so this only happens until we do some LRU > > writeback. I guess I should have been more clear here. Thanks for > > clarifying and correcting. > > > > > > > > > > Other than breaking LRU ordering, I'm also concerned about the > > > > > potential deteriorating performance when writing/reading from swapfile > > > > > when zswap is full. The zswap->swapfile order should be able to > > > > > maintain a consistent performance for userspace. > > > > > > > > Right. This happens today anyway AFAICT, when zswap is full we just > > > > fallback to writing to swapfile, so this would not be a behavior > > > > change. I agree it should be addressed anyway. > > > > > > > > > > > > > > But anyway I don't have the data from real life workload to back the > > > > > above points. If you or Johannes could share some real data, that > > > > > would be very helpful to make the decisions. > > > > > > > > I actually don't, since we mostly run zswap without a backing > > > > swapfile. Perhaps Johannes might be able to have some data on this (or > > > > anyone using zswap with a backing swapfile). > > > > > > Due to LRU writeback, the latency increase when zswap spills its > > > coldest entries into backing swap is fairly linear, as you may > > > expect. We have some limited production data on this from the > > > webservers. > > > > > > The biggest challenge in this space is properly sizing the zswap pool, > > > such that it's big enough to hold the warm set that the workload is > > > most latency-sensitive too, yet small enough such that the cold pages > > > get spilled to backing swap. Nhat is working on improving this. > > > > > > That said, I think this discussion is orthogonal to the proposed > > > topic. zswap spills to backing swap in LRU order as of today. The > > > LRU/pool size tweaking is an optimization to get smarter zswap/swap > > > placement according to access frequency. The proposed swap descriptor > > > is an optimization to get better disk utilization, the ability to run > > > zswap without backing swap, and a dramatic speedup in swapoff time. > > > > Fully agree. > > > > > > > > > > > > > > Anyway I'm interested in attending the discussion for this topic. > > > > > > > > > > > > > > > > Great! Looking forward to discuss this more! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ==================== Cost ==================== > > > > > > > > > > The obvious downside of this is added memory overhead, specifically > > > > > > > > > > for users that use swapfiles without zswap. Instead of paying one byte > > > > > > > > > > (swap_map) for every potential page in the swapfile (+ swap count > > > > > > > > > > continuation), we pay the size of the swap_desc for every page that is > > > > > > > > > > actually in the swapfile, which I am estimating can be roughly around > > > > > > > > > > 24 bytes or so, so maybe 0.6% of swapped out memory. The overhead only > > > > > > > > > > scales with pages actually swapped out. For zswap users, it should be > > > > > > > > > > a win (or at least even) because we get to drop a lot of fields from > > > > > > > > > > struct zswap_entry (e.g. rbtree, index, etc). > > > > > > Shifting the cost from O(swapspace) to O(swapped) could be a win for > > > many regular swap users too. > > > > > > There are the legacy setups that provision 2*RAM worth of swap as an > > > emergency overflow that is then rarely used. > > > > > > We have a setups that swap to disk more proactively, but we also > > > overprovision those in terms of swap space due to the cliff behavior > > > when swap fills up and the VM runs out of options. > > > > > > To make a fair comparison, you really have to take average swap > > > utilization into account. And I doubt that's very high. > > > > Yeah I was looking for some data here, but it varies heavily based on > > the use case, so I opted to only state the overhead of the swap > > descriptor without directly comparing it to the current overhead. > > > > > > > > In terms of worst-case behavior, +0.8% per swapped page doesn't sound > > > like a show-stopper to me. Especially when compared to zswap's current > > > O(swapped) waste of disk space. > > > > Yeah for zswap users this should be a win on most/all fronts, even > > memory overhead, as we will end up trimming struct zswap_entry which > > is also O(swapped) memory overhead. It should also make zswap > > available for more use cases. You don't need to provision and > > configure swap space, you just need to turn zswap on. > > > > > > > > > > > > > > > Another potential concern is readahead. With this design, we have no > > > > > > > > > > way to get a swap_desc given a swap entry (type & offset). We would > > > > > > > > > > need to maintain a reverse mapping, adding a little bit more overhead, > > > > > > > > > > or search all swapped out pages instead :). A reverse mapping might > > > > > > > > > > pump the per-swapped page overhead to ~32 bytes (~0.8% of swapped out > > > > > > > > > > memory). > > > > > > > > > > > > > > > > > > > > ==================== Bottom Line ==================== > > > > > > > > > > It would be nice to discuss the potential here and the tradeoffs. I > > > > > > > > > > know that other folks using zswap (or interested in using it) may find > > > > > > > > > > this very useful. I am sure I am missing some context on why things > > > > > > > > > > are the way they are, and perhaps some obvious holes in my story. > > > > > > > > > > Looking forward to discussing this with anyone interested :) > > > > > > > > > > > > > > > > > > > > I think Johannes may be interested in attending this discussion, since > > > > > > > > > > a lot of ideas here are inspired by discussions I had with him :) > > Hi everyone, > > I came across this interesting proposal and I would like to > participate in the discussion. I think it will be useful/overlap with > some projects we are currently planning in Android. Great to see more interested folks! Looking forward to discussing that! > > Thanks, > Kalesh > > > > > > > Thanks! > >