Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Mar 5, 2024 at 8:17 PM Jared Hulbert <jaredeh@xxxxxxxxx> wrote:
> > >
> > > What stops us from using an existing compressing filesystem?
> >
> > The issue is that the swap has a lot of different usage than a typical
> > file system. Please take a look at the current different usage cases
> > of swap and their related data structures, in the beginning of this
> > email thread.  If you want to use an existing file system, you still
> > need to to bridge the gap between swap system and file systems. For
> > example, the cgroup information is associated with each swap entry.
> >
> > You can think of swap as  a special file system that can read and
> > write 4K objects by keys.  You can always use file system extend
> > attributes to track the additional information associated with each
> > swap entry.
>
> Yes.  This is what I was trying to say.  While swap dev pretends to be
> just a simple index, your opener for this thread mentions a VFS-like
> swap interface.  What exactly is the interface you have in mind?  If
> it's VFS-like... how does it differ?

Let me clarify what I mean by VFS like.  I want the swap device to
have some common swap related operation using VFS like callback
functions. So that it can allow different implementations of swap
devices to live together. For example, the classic swap and swap
cluster should be able to register as two different swap back end
implementations using common operation functions.

I do not mean to borrow the VFS operation interface as it is. The swap
back end requirement is very different from a typical file system.

>
> >The end of the day, using the existing file system, the
> > per swap entry metadata overhead would  likely be much higher than the
> > current swap back end. I understand the current swap back end
> > organizes the data around swap offset, that makes swap data spreading
> > to many different places. That is one reason people might not like it.
> > However, it does have pretty minimal per swap entry memory overheads.
> >
> > The file system can store their meta data on disk, reducing the in
> > memory overhead. That has a price that when you swap in a page, you
> > might need to go through a few file system metadata reads before you
> > can read in the real swapping data.
>
> When I look at all the things being asked of modern swap backends,
> compression, tiering, metadata tracking, usage metrics, caching,
> backing storage.  There is a lot of potential for reuse from the
> filesystem world.  If we truly have a VFS-like swap interface why not
> make it easy to facilitate that reuse?
>
> So of course I don't think we should just take stock btrfs and call it
> a swap backing store.  When I asked "Why stops us..." I meant to
> discuss it to see how far off the vision is.
>
> So let's consider the points you mentioned.
>
> Metadata overhead:
> ZRAM uses 1% of the disksize as metadata storage, you can get to 1% or
> less with modern filesystems unmodified (depends on a lot of factors)

If your file size is 4K each and you need to store millions of 4K
small files, reference it by an integer like filename. Typical file
systems like ext4, btrfs will definitely not be able to get 1% meta
data storage for that kind of usage. 1% of 4K is 40 bytes. Your
typical inode struct is much bigger than that. Last I checked, the
sizeof(struct inode) is 632.

> From a fundamental architecture standpoint it's not a stretch to think
> that a modified filesystem would be meet or beat existing swap engines
> on metadata overhead.

Please show me one file system that can beat the existing swap system
in the swap specific usage case (load/store of individual 4K pages), I
am interested in learning.

>
> Too many disk ops:
> This is a solid argument for not using most filesystems today.  But
> it's also one that is addressable, modern filesystems have lots of
> caching and separation of metadata and data.  No reason a variant
> can't be made that will not store metadata to disk.

That is based on the assumption that you can predict the next IO based
on previous IO, specially in file streaming.
Swap access is typically more random.

>
> In the traditional VFS space fragmentation and allocation is the
> responsibility of the filesystem, not the pagecache or VFS layer (okay
> it gets complicated in the corner cases).  If we call swap backends
> the swap filesystems then I don't think it's hard to imagine a
> modified (or a new) filesystem could be rather easily adapted to
> handle many of the things you're looking for if we made a swapping
> VFS-like interface that was a truely a clean subset of the VFS
> interface
>
> With a whole family of specialized swap filesystems optimized for
> different systems and media types you could do buddy allocating,
> larger writes, LRU level group allocations, sub-page allocation,
> direct writes, compression, tiering, readahead hints, deduplication,
> caching, etc with nearly off the shelf code.  And all this with a free
> set of stable APIs, tools, conventions, design patterns, and
> abstractions to allow for quick and easy innovation in this space.

If you have a more concrete example of how to map an existing file
system to match the current swap usage, we can discuss more on the
trade off of size.  The usage case and constraint of swap and file
system is so different. I believe a custom designed swap back end will
suit better than just borrowing the existing file system as it is.
>
> And if that seems daunting we can start by making existing swap
> backends glue into the new VFS-like interface and punt this for later.
> But making clear and clean VFS like interfaces, if done right allows
> for a ton of innovation here.
>
> > >
> > > Crazy talk here.  What if we handled swap pages like they were mmap'd
> > > to a special swap "file(s)"?
> >
> > That is already the case in the kernel, the swap cache handling is the
> > same way of handling file cache with a file offset. Some of them even
> > share the same underlying function, for example filemap_get_folio().
>
> Right there is some similarity in the middle.  And yet the way a
> swapped page is handled is very different at the "ends"  the PTEs /
> fault paths and the way data gets to swap media are totally different.
> Those are the parts I was thinking about.  In otherwords, why do a
> VFS-like interface, why not use the VFS interface?

Again, I have no intention to use the existing VFS interface and graft
it to the swap back end. I am yet to be convinced that is the right
direction to go. I most mean you can register your own implementation
of swap back end using common operation interfaces. The interface will
be specific to swap related operations, not the VFS one.

> I suppose the first level why gets you to something like a circular
> reference issue when allocating memory for a VFS ops that could
> trigger swapping... but maybe that's addressable.  It gets crazy but I
> have a feeling the core issues are not too serious.

Don't let me discourage you though. Feel free to give it a try and
share more detail on how you plan to do that. For example I can use
this xyz file system, each swap entry map to an inode of a file. Here
is how to allocate and free a swap entry. The more detail the better.

Chris





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux