Chris Li <chrisl@xxxxxxxxxx> writes: > On Tue, Mar 28, 2023 at 06:41:54PM -0700, Yosry Ahmed wrote: >> My main concern here would be having two separate swap counting >> implementations -- although it might not be the end of the world. It >> would be useful to consider all the options. So far, I think we have > > Agree. > >> been discussing 3 alternatives: >> >> (a) The initial swap_desc proposal. >> (b) Add an optional indirection layer that can move swap entries >> between swap devices and add a virtual swap device for zswap in the >> kernel. > > For the completeness sake let me add some option that have both pros > and cons. > > (d) There is the google's ghost swap file. I understand it mean a bit > ABI change. It has the advantange that it allow more than one > zswap swapfile. Google use it that way. Another consideration is > that ghost swap file compatible with exisiting swapon behavior. > You can see how much swap entry was used from swapon summary. > Some application might depend on that. > > We might able to find some way to break ABI less. > >> (c) Add an optional indirection layer that can move entries between >> different swap backends. Swap backends would be zswap & swap devices >> for now. Zswap needs to implement swap entry management, swap >> counting, etc. > (f) I have been thinking of variants of (b) without adding a virtual > swap device for zswap, using the ghost swap file instead. > > Also the indirection is optional per swap entry at run time. > Some swap devices can have some entries move to another swap device. > Only those swap entries pay the price of the indirection layer. > > (e) This is the long term goal I have in mind. A VFS like > implementation for swap file. Let's call it VSW. > This allows different swap devices using different > swap file system implementations. I like this too! > A lot of the difficult trade off we have right now: > Smaller per entry up front allocate like swap_map[] for all > entry vs only allocating memory for swap entry that has been > swap out, but a larger per entry allocation. Yes. > I believe some of those trade offs can be addressed by having a > different swap file system. I do mean a different "mkswap" > that kind of file system. We may don't need that, because the swap on-disk format needn't to be permanent across rebooting. > We can write out some of the swap > entry meta data to the swap file system as well. It means > we don't have to pay the larger per swap entry allocation overhead > for very cold pages. it might need to take two reads to swap > in some of the very cold swap entries. But that should be rare. Sound like a good idea. At least can be investigated further. > It can offer benefits for swapping out larger folio as well. > Right now swapping out large folios still needs to go through > the per 4k page swap index allocation and break down. > > Basically, modernized the swap file system. > > The redirection layer should be able to implement within VSW > as well. > > I know that is a very ambitious plan :-) Yes. > We can do that incrementally. The swap file system doesn't have > much backward compatibility cross reboot, should be easier than > the normal file system. Agree. Best Regards, Huang, Ying