On Tue, Mar 5, 2024 at 12:16 AM Chengming Zhou <chengming.zhou@xxxxxxxxx> wrote: > > We can write out zsmalloc blocks of data as it is, however there is no > > guarantee the data in zsmalloc blocks have the same LRU order. > > Right, so we should choose to write out objects based on the LRU order > in zswap, but don't decompress it, write out it directly to swap file. Here is an idea. Since zsmalloc uses N pages as a block to store the data, we can have a backend read the compressed data, write out to another zsmalloc in N page blocks, with LRU ordering. Then those N pages block write out to the swap file, The meta data of zsmalloc keep track of the handle will convert into physical locations of the disk. Those meta data of zsmalloc will stay in the memory. > > > > > It makes more sense when writing higher order > 0 swap pages. e.g > > writing 64K pages in one buffer, then we can write out compressed data > > as page boundary aligned and page sizes, accepting the waste on the > > last compressed page, might not fill up the whole page. > > > >> > >> Right, I also thought about this direction for some time. > >> Apart from fewer IO, there are more advantages we can see: > >> > >> 1. Don't need to allocate a page when write out compressed data. > >> This method actually has its own problem[1], by allocating a new page and > >> put on LRU list, wait for writeback and reclaim. > >> If we write out compressed data directly, so don't need to allocated page, > >> these problems can be avoided. > > > > Does it go through swap cache at all? If not, there will be some > > interesting synchronization issues when other races swap in the page > > and modify it. > > No, right, we have to handle the races. (Maybe we can leave "shadow" entry in zswap, > which can be used for synchronization) I kind of wish swap cache store either folio or a pointer to the swap entry struct. At the cost of one extra pointer per swap entry, we can have different types of swap entry struct, e.g. zswap. The shadow will be the common part of the swap entry members. Then zswap or more fancy swap entries can allocate different types of swap structs. That will simplify a lot of swap cache for each looping code as well, no need to deal with is_value() of swap entry. We just need a tag to tell it is folio or swap entry pointer. > > > > >> > >> 2. Don't need to decompress when write out compressed data. > > > > Yes. > > > >> > >> [1] https://lore.kernel.org/all/20240209115950.3885183-1-chengming.zhou@xxxxxxxxx/ > >> > >>> > >>> I'm sure it'd be a big redesign, but that seems to be what we're talking > >>> about anyway. > >>> > >> > >> Yes, we need to do modifications in some parts: > >> > >> 1. zsmalloc: compressed objects can be migrated anytime, we need to support pinning. > > > > Or use a bounce buffer to read it out. > > Yeah, also a choice if pinning is not easy to implement :) In the above another zsmalloc backend idea, the bounce buffer is kind of required to compact different size objects into page aligned blocks. That removes the pinning requirement as well. Chris