Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"

Chris Li <chrisl@xxxxxxxxxx> · Wed, 6 Mar 2024 16:46:23 -0800

On Wed, Mar 6, 2024 at 2:44 PM Jared Hulbert <jaredeh@xxxxxxxxx> wrote:
>
> On Wed, Mar 6, 2024 at 10:16 AM Chris Li <chrisl@xxxxxxxxxx> wrote:
> >
> > On Wed, Mar 6, 2024 at 2:39 AM Jared Hulbert <jaredeh@xxxxxxxxx> wrote:
> > >
> > > On Tue, Mar 5, 2024 at 9:51 PM Chris Li <chrisl@xxxxxxxxxx> wrote:
> >
> > > > If your file size is 4K each and you need to store millions of 4K
> > > > small files, reference it by an integer like filename. Typical file
> > > > systems like ext4, btrfs will definitely not be able to get 1% meta
> > > > data storage for that kind of usage. 1% of 4K is 40 bytes. Your
> > > > typical inode struct is much bigger than that. Last I checked, the
> > > > sizeof(struct inode) is 632.
> > >
> > > Okay that is an interesting difference in assumptions.  I see no need
> > > to have file == page, I think that would be insane to have an inode
> > > per swap page.  You'd have one big "file" and do offsets.  Or a file
> > > per cgroup, etc.
> >
> > Then you are back to design your own data structure to manage how to
> > map the swap entry into large file offsets. The swap file is a one
> > large file, it can group  clusters as smaller large files internally.
>
> No, that's not how I see it. I must be missing something.  From my
> perspective I am suggesting we should NOT be designing our own data
> structures to manage how to map the swap entries into large file
> offsets.

OK, you are suggesting not using file inodes for 4K swap pages.
Also not design our own data structure to manage swap entry allocation.

>
> This is nearly identical to the database use case which has been a
> huge driver of filesystem and block subsystem optimizations over the
> years.   In practice it's not uncommon to have a dedicated filesystem
> dominated with one huge database file, a smaller transaction log, and
> some metadata files about the database.  The workload for the database
> is random reads and writes at 8K, while the log file is operated like
> a write only ringbuffer most of the time.  And filesystems have been
> designed and optimized for decades (and continue to to be optimized)
> to properly place data on the media.  All the data structures and
> grouping logic is present.  Filesystems aren't just about directories
> and files.  Those are the easy parts.

Then how do you allocate swap entries using this file system or database?
More detail on how swap entries map into the large files offsets can
help me understand what you are trying to do.

>
> > Why not use the swap file directly? The VFS does not really help,
>
> I don't understand your question?  How do you have a "swap file"
> without a clearly defined API?  What am I lissing.

Swap file support exists in the kernel. You can block IO on the swap
device with a given offset. The block device API exists.  That is how
the swap back end works right now. I am not sure I understand your
question.

Chris

>
> > it
> > is more of a burden to maintain all those super blocks, directory,
> > inode etc.
>
> I mean... how is the minimum required superblock different than the
> header on a swap partition?  Sure we can strip out features that
> aren't needed. What directories and inodes are you maintaining?  But
> if your swap store happened to support extra features... why does it
> matter?
>
> > > Remember I'm advocating a subset of the VFS interface, learning from
> > > it not using it as is.
> >
> > You can't really use a subset without having the other parts drag
> > alone. Most of the VFS operations, those op call back functions do not
> > apply to swap directly any way.
> > If you say VFS is just an inspiration, then that is more or less what
> > I had in mind earlier :-)
>
> Of course you can use a subset without having the other parts drag
> along.  That's the definition of subset, at least how I intent it.
>
> Matthew Wilcox talked about integrating zswap and swap more tightly.
> I feel like it's not clear how zswap and swap _should_ interact given
> the state of the swap related APIs such as they are
>
> On the other hand there are several canonical and easy to implement
> ways to do something similar in traditional fs/vfs land.
>
> 1. A filesystem that compressed data in RAM and did writeback to
> blockdev, it would have to have a blockdev aware allocator.
> 2. A filesystem that compressed data in RAM that overlaid another
> filesystem. Would require uncompressing to do writeback (unless VFS
> was extended with cwrite() cread() )
> 3. A block dev that compressed data in RAM under a filesystem, it
> would have to have a block dev aware allocator.
>
> I'd like to talk about making this sort of thing simple and clean to
> do with swap.
>
> > >
> > > > > From a fundamental architecture standpoint it's not a stretch to think
> > > > > that a modified filesystem would be meet or beat existing swap engines
> > > > > on metadata overhead.
> > > >
> > > > Please show me one file system that can beat the existing swap system
> > > > in the swap specific usage case (load/store of individual 4K pages), I
> > > > am interested in learning.
> > >
> > > Well mind you I'm suggesting a modified filesystem and this is hard to
> > > compare apples to apples, but sure... here we go :)
> > >
> > > Consider an unmodified EXT4 vs ZRAM with a backing device of the same
> > > sizes, same hardware.
> > >
> > > Using the page cache as a bad proxy for RAM caching in the case of
> > > EXT4 and comparing to the ZRAM without sending anything to the backing
> > > store. The ZRAM is faster at reads while the EXT4 is a little faster
> > > at writes
> > >
> > >       | ZRAM     | EXT4     |
> > > -----------------------------
> > > read  | 4.4 GB/s | 2.5 GB/s |
> > > write | 643 MB/s | 658 MB/s |
> > >
> > > If you look at what happens when you talk about getting thing to and
> > > from the disk then while the ZRAM is a tiny bit faster at the reads
> > > but ZRAM is way slow at writes.
> > >
> > >       | ZRAM      | EXT4      |
> > > -------------------------------
> > > read  | 1.14 GB/s | 1.10 GB/s |
> > > write | 82.3 MB/s |  548 MB/s |
> >
> > I am more interested in terms of per swap entry memory overhead.
> >
> > Without knowing how you map the swap entry into file read/writes, I
> > have no idea now how to interpertet those numbers in the swap back end
> > usage context. ZRAM is just a block device, ZRAM does not participate
> > in how the swap entry was allocated or free. ZRAM does compression,
> > which is CPU intensive.  While EXT4 doesn't, it is understandable ZRAM
> > might have lower write bandwidth.   I am not sure how those numbers
> > translate into prediction of how a file system based swap back end
> > system performs.
>
> I randomly read/write to zram block dev and one large EXT4 file with
> max concurrency for my system.  If you mounted the file and the zram
> as swap devs the performance from the benchmark should transfer to
> swap operations.  How that maps to system performance....? That's a
> more complicated benchmarking question.
>
> > Regards,
> >
> > Chris
>