Re: [LSF/MM/BPF TOPIC] Large folios, swap and fscache

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Fri, 2 Feb 2024 19:22:11 +0000

On Fri, Feb 02, 2024 at 03:57:44PM +0000, David Howells wrote:
> Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> 
> > So my modest proposal is that we completely rearchitect how we handle
> > swap.  Instead of putting swp entries in the page tables (and in shmem's
> > case in the page cache), we turn swap into an (object, offset) lookup
> > (just like a filesystem).  That means that each anon_vma becomes its
> > own swap object and each shmem inode becomes its own swap object.
> > The swap system can then borrow techniques from whichever filesystem
> > it likes to do (object, offset, length) -> n x (device, block) mappings.
> 
> That's basically what I'm suggesting, I think, but offloading the mechanics
> down to a filesystem.  That would be fine with me.  bcachefs is an {key,val}
> store right?

Hmm.  That's not a bad idea.  So instead of having a swapfile, we
could create a swap directory on an existing filesystem.  Or if we
want to partition the drive and have a swap partition we just
mkfs.favourite that and tell it that root is the swap directory.

I think this means we do away with the swap cache?  If the page has been
brought back in, we'd be able to find it in the anon_vma's page cache
rather than having to search the global swap cache.

> > I think my proposal above works for you?  For each file you want to cache,
> > create a swap object, and then tell swap when you want to read/write to
> > the local swap object.  What you do need is to persist the objects over
> > a power cycle.  That shouldn't be too hard ... after all, filesystems
> > manage to do it.
> 
> Sure - but there is an integrity constraint that doesn't exist with swap.
> 
> There is also an additional feature of fscache: unless the cache entry is
> locked in the cache (e.g. we're doing diconnected operation), we can throw
> away an object from fscache and recycle it if we need space.  In fact, this is
> the way OpenAFS works: every write transaction done on a file/dir on the
> server is done atomically and is given a monotonically increasing data version
> number that is then used as part of the index key in the cache.  So old
> versions of the data get recycled as the cache needs to make space.
> 
> Which also means that if swap needs more space, it can just kick stuff out of
> fscache if it is not locked in.

Ah, more requirements ;-)

> > All we need to do is figure out how to name the lookup (I don't think we
> > need to use strings to name the swap object, but obviously we could).  Maybe
> > it's just a stream of bytes.
> 
> A binary blob would probably be better.
> 
> I would use a separate index to map higher level organisations, such as
> cell+volume in afs or the server address + share name in cifs to an index
> number that can be used in the cache.
> 
> Further, I could do with a way to invalidate all objects matching a particular
> subkey.

That seems to map to a directory hierarchy?

So, named swap objects for fscache; anonymous ones for anon memory?