Re: [LSF/MM/BPF TOPIC] Large folios, swap and fscache

David Howells <dhowells@xxxxxxxxxx> · Fri, 02 Feb 2024 15:57:44 +0000

Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:

> So my modest proposal is that we completely rearchitect how we handle
> swap.  Instead of putting swp entries in the page tables (and in shmem's
> case in the page cache), we turn swap into an (object, offset) lookup
> (just like a filesystem).  That means that each anon_vma becomes its
> own swap object and each shmem inode becomes its own swap object.
> The swap system can then borrow techniques from whichever filesystem
> it likes to do (object, offset, length) -> n x (device, block) mappings.

That's basically what I'm suggesting, I think, but offloading the mechanics
down to a filesystem.  That would be fine with me.  bcachefs is an {key,val}
store right?

> > Further to this, we have at least two ways to cache data on
> > disk/flash/etc. - swap and fscache - and both want to set aside disk space
> > for their operation.  Might it be possible to combine the two?
> > 
> > One thing I want to look at for fscache is the possibility of switching
> > from a file-per-object-based approach to a tagged cache more akin to the
> > way OpenAFS does things.  In OpenAFS, you have a whole bunch of small
> > files, each containing a single block (e.g. 256K) of data, and an index
> > that maps a particular {volume,file,version,block} to one of these files
> > in the cache.
> 
> I think my proposal above works for you?  For each file you want to cache,
> create a swap object, and then tell swap when you want to read/write to
> the local swap object.  What you do need is to persist the objects over
> a power cycle.  That shouldn't be too hard ... after all, filesystems
> manage to do it.

Sure - but there is an integrity constraint that doesn't exist with swap.

There is also an additional feature of fscache: unless the cache entry is
locked in the cache (e.g. we're doing diconnected operation), we can throw
away an object from fscache and recycle it if we need space.  In fact, this is
the way OpenAFS works: every write transaction done on a file/dir on the
server is done atomically and is given a monotonically increasing data version
number that is then used as part of the index key in the cache.  So old
versions of the data get recycled as the cache needs to make space.

Which also means that if swap needs more space, it can just kick stuff out of
fscache if it is not locked in.

> All we need to do is figure out how to name the lookup (I don't think we
> need to use strings to name the swap object, but obviously we could).  Maybe
> it's just a stream of bytes.

A binary blob would probably be better.

I would use a separate index to map higher level organisations, such as
cell+volume in afs or the server address + share name in cifs to an index
number that can be used in the cache.

Further, I could do with a way to invalidate all objects matching a particular
subkey.

David