Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap

"Huang, Ying" <ying.huang@xxxxxxxxx> · Tue, 04 Apr 2023 16:24:58 +0800

Chris Li <chrisl@xxxxxxxxxx> writes:

> On Tue, Mar 28, 2023 at 06:41:54PM -0700, Yosry Ahmed wrote:
>> My main concern here would be having two separate swap counting
>> implementations -- although it might not be the end of the world. It
>> would be useful to consider all the options. So far, I think we have
>
> Agree.
>
>> been discussing 3 alternatives:
>> 
>> (a) The initial swap_desc proposal.
>> (b) Add an optional indirection layer that can move swap entries
>> between swap devices and add a virtual swap device for zswap in the
>> kernel.
>
> For the completeness sake let me add some option that have both pros
> and cons.
>
> (d) There is the google's ghost swap file. I understand it mean a bit
> ABI change. It has the advantange that it allow more than one
> zswap swapfile. Google use it that way. Another consideration is
> that ghost swap file compatible with exisiting swapon behavior.
> You can see how much swap entry was used from swapon summary.
> Some application might depend on that.
>
> We might able to find some way to break ABI less. 
>
>> (c) Add an optional indirection layer that can move entries between
>> different swap backends. Swap backends would be zswap & swap devices
>> for now. Zswap needs to implement swap entry management, swap
>> counting, etc.
> (f) I have been thinking of variants of (b) without adding a virtual
> swap device for zswap, using the ghost swap file instead.
>
> Also the indirection is optional per swap entry at run time.
> Some swap devices can have some entries move to another swap device.
> Only those swap entries pay the price of the indirection layer.
>
> (e) This is the long term goal I have in mind. A VFS like
> implementation for swap file. Let's call it VSW.
> This allows different swap devices using different
> swap file system implementations.

I like this too!

> A lot of the difficult trade off we have right now:
> Smaller per entry up front allocate like swap_map[] for all
> entry vs only allocating memory for swap entry that has been
> swap out, but a larger per entry allocation.

Yes.

> I believe some of those trade offs can be addressed by having a
> different swap file system. I do mean a different "mkswap"
> that kind of file system.

We may don't need that, because the swap on-disk format needn't to be
permanent across rebooting.

> We can write out some of the swap
> entry meta data to the swap file system as well. It means
> we don't have to pay the larger per swap entry allocation overhead
> for very cold pages. it might need to take two reads to swap
> in some of the very cold swap entries. But that should be rare.

Sound like a good idea.  At least can be investigated further.

> It can offer benefits for swapping out larger folio as well.
> Right now swapping out large folios still needs to go through
> the per 4k page swap index allocation and break down.
>
> Basically, modernized the swap file system.
>
> The redirection layer should be able to implement within VSW
> as well.
>
> I know that is a very ambitious plan :-)

Yes.

> We can do that incrementally. The swap file system doesn't have
> much backward compatibility cross reboot, should be easier than
> the normal file system.

Agree.

Best Regards,
Huang, Ying