Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"

Chris Li <chrisl@xxxxxxxxxx> · Wed, 6 Mar 2024 10:16:12 -0800

On Wed, Mar 6, 2024 at 2:39 AM Jared Hulbert <jaredeh@xxxxxxxxx> wrote:
>
> On Tue, Mar 5, 2024 at 9:51 PM Chris Li <chrisl@xxxxxxxxxx> wrote:

> > If your file size is 4K each and you need to store millions of 4K
> > small files, reference it by an integer like filename. Typical file
> > systems like ext4, btrfs will definitely not be able to get 1% meta
> > data storage for that kind of usage. 1% of 4K is 40 bytes. Your
> > typical inode struct is much bigger than that. Last I checked, the
> > sizeof(struct inode) is 632.
>
> Okay that is an interesting difference in assumptions.  I see no need
> to have file == page, I think that would be insane to have an inode
> per swap page.  You'd have one big "file" and do offsets.  Or a file
> per cgroup, etc.

Then you are back to design your own data structure to manage how to
map the swap entry into large file offsets. The swap file is a one
large file, it can group  clusters as smaller large files internally.
Why not use the swap file directly? The VFS does not really help, it
is more of a burden to maintain all those super blocks, directory,
inode etc.

> Remember I'm advocating a subset of the VFS interface, learning from
> it not using it as is.

You can't really use a subset without having the other parts drag
alone. Most of the VFS operations, those op call back functions do not
apply to swap directly any way.
If you say VFS is just an inspiration, then that is more or less what
I had in mind earlier :-)

>
> > > From a fundamental architecture standpoint it's not a stretch to think
> > > that a modified filesystem would be meet or beat existing swap engines
> > > on metadata overhead.
> >
> > Please show me one file system that can beat the existing swap system
> > in the swap specific usage case (load/store of individual 4K pages), I
> > am interested in learning.
>
> Well mind you I'm suggesting a modified filesystem and this is hard to
> compare apples to apples, but sure... here we go :)
>
> Consider an unmodified EXT4 vs ZRAM with a backing device of the same
> sizes, same hardware.
>
> Using the page cache as a bad proxy for RAM caching in the case of
> EXT4 and comparing to the ZRAM without sending anything to the backing
> store. The ZRAM is faster at reads while the EXT4 is a little faster
> at writes
>
>       | ZRAM     | EXT4     |
> -----------------------------
> read  | 4.4 GB/s | 2.5 GB/s |
> write | 643 MB/s | 658 MB/s |
>
> If you look at what happens when you talk about getting thing to and
> from the disk then while the ZRAM is a tiny bit faster at the reads
> but ZRAM is way slow at writes.
>
>       | ZRAM      | EXT4      |
> -------------------------------
> read  | 1.14 GB/s | 1.10 GB/s |
> write | 82.3 MB/s |  548 MB/s |

I am more interested in terms of per swap entry memory overhead.

Without knowing how you map the swap entry into file read/writes, I
have no idea now how to interpertet those numbers in the swap back end
usage context. ZRAM is just a block device, ZRAM does not participate
in how the swap entry was allocated or free. ZRAM does compression,
which is CPU intensive.  While EXT4 doesn't, it is understandable ZRAM
might have lower write bandwidth.   I am not sure how those numbers
translate into prediction of how a file system based swap back end
system performs.

Regards,

Chris