Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap

Chris Li <chrisl@xxxxxxxxxx> · Wed, 29 Mar 2023 09:04:20 -0700

On Tue, Mar 28, 2023 at 06:41:54PM -0700, Yosry Ahmed wrote:
> My main concern here would be having two separate swap counting
> implementations -- although it might not be the end of the world. It
> would be useful to consider all the options. So far, I think we have

Agree.

> been discussing 3 alternatives:
> 
> (a) The initial swap_desc proposal.
> (b) Add an optional indirection layer that can move swap entries
> between swap devices and add a virtual swap device for zswap in the
> kernel.

For the completeness sake let me add some option that have both pros
and cons.

(d) There is the google's ghost swap file. I understand it mean a bit
ABI change. It has the advantange that it allow more than one
zswap swapfile. Google use it that way. Another consideration is
that ghost swap file compatible with exisiting swapon behavior.
You can see how much swap entry was used from swapon summary.
Some application might depend on that.

We might able to find some way to break ABI less. 

> (c) Add an optional indirection layer that can move entries between
> different swap backends. Swap backends would be zswap & swap devices
> for now. Zswap needs to implement swap entry management, swap
> counting, etc.
(f) I have been thinking of variants of (b) without adding a virtual
swap device for zswap, using the ghost swap file instead.

Also the indirection is optional per swap entry at run time.
Some swap devices can have some entries move to another swap device.
Only those swap entries pay the price of the indirection layer.

(e) This is the long term goal I have in mind. A VFS like
implementation for swap file. Let's call it VSW.
This allows different swap devices using different
swap file system implementations.

A lot of the difficult trade off we have right now:
Smaller per entry up front allocate like swap_map[] for all
entry vs only allocating memory for swap entry that has been
swap out, but a larger per entry allocation.

I believe some of those trade offs can be addressed by having a
different swap file system. I do mean a different "mkswap"
that kind of file system. We can write out some of the swap
entry meta data to the swap file system as well. It means
we don't have to pay the larger per swap entry allocation overhead
for very cold pages. it might need to take two reads to swap
in some of the very cold swap entries. But that should be rare.

It can offer benefits for swapping out larger folio as well.
Right now swapping out large folios still needs to go through
the per 4k page swap index allocation and break down.

Basically, modernized the swap file system.

The redirection layer should be able to implement within VSW
as well.

I know that is a very ambitious plan :-)

We can do that incrementally. The swap file system doesn't have
much backward compatibility cross reboot, should be easier than
the normal file system.

Chris