On Tue, Mar 28, 2023 at 06:41:54PM -0700, Yosry Ahmed wrote: > My main concern here would be having two separate swap counting > implementations -- although it might not be the end of the world. It > would be useful to consider all the options. So far, I think we have Agree. > been discussing 3 alternatives: > > (a) The initial swap_desc proposal. > (b) Add an optional indirection layer that can move swap entries > between swap devices and add a virtual swap device for zswap in the > kernel. For the completeness sake let me add some option that have both pros and cons. (d) There is the google's ghost swap file. I understand it mean a bit ABI change. It has the advantange that it allow more than one zswap swapfile. Google use it that way. Another consideration is that ghost swap file compatible with exisiting swapon behavior. You can see how much swap entry was used from swapon summary. Some application might depend on that. We might able to find some way to break ABI less. > (c) Add an optional indirection layer that can move entries between > different swap backends. Swap backends would be zswap & swap devices > for now. Zswap needs to implement swap entry management, swap > counting, etc. (f) I have been thinking of variants of (b) without adding a virtual swap device for zswap, using the ghost swap file instead. Also the indirection is optional per swap entry at run time. Some swap devices can have some entries move to another swap device. Only those swap entries pay the price of the indirection layer. (e) This is the long term goal I have in mind. A VFS like implementation for swap file. Let's call it VSW. This allows different swap devices using different swap file system implementations. A lot of the difficult trade off we have right now: Smaller per entry up front allocate like swap_map[] for all entry vs only allocating memory for swap entry that has been swap out, but a larger per entry allocation. I believe some of those trade offs can be addressed by having a different swap file system. I do mean a different "mkswap" that kind of file system. We can write out some of the swap entry meta data to the swap file system as well. It means we don't have to pay the larger per swap entry allocation overhead for very cold pages. it might need to take two reads to swap in some of the very cold swap entries. But that should be rare. It can offer benefits for swapping out larger folio as well. Right now swapping out large folios still needs to go through the per 4k page swap index allocation and break down. Basically, modernized the swap file system. The redirection layer should be able to implement within VSW as well. I know that is a very ambitious plan :-) We can do that incrementally. The swap file system doesn't have much backward compatibility cross reboot, should be easier than the normal file system. Chris