On Fri, Feb 02, 2024 at 03:57:44PM +0000, David Howells wrote: > Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > > So my modest proposal is that we completely rearchitect how we handle > > swap. Instead of putting swp entries in the page tables (and in shmem's > > case in the page cache), we turn swap into an (object, offset) lookup > > (just like a filesystem). That means that each anon_vma becomes its > > own swap object and each shmem inode becomes its own swap object. > > The swap system can then borrow techniques from whichever filesystem > > it likes to do (object, offset, length) -> n x (device, block) mappings. > > That's basically what I'm suggesting, I think, but offloading the mechanics > down to a filesystem. That would be fine with me. bcachefs is an {key,val} > store right? Hmm. That's not a bad idea. So instead of having a swapfile, we could create a swap directory on an existing filesystem. Or if we want to partition the drive and have a swap partition we just mkfs.favourite that and tell it that root is the swap directory. I think this means we do away with the swap cache? If the page has been brought back in, we'd be able to find it in the anon_vma's page cache rather than having to search the global swap cache. > > I think my proposal above works for you? For each file you want to cache, > > create a swap object, and then tell swap when you want to read/write to > > the local swap object. What you do need is to persist the objects over > > a power cycle. That shouldn't be too hard ... after all, filesystems > > manage to do it. > > Sure - but there is an integrity constraint that doesn't exist with swap. > > There is also an additional feature of fscache: unless the cache entry is > locked in the cache (e.g. we're doing diconnected operation), we can throw > away an object from fscache and recycle it if we need space. In fact, this is > the way OpenAFS works: every write transaction done on a file/dir on the > server is done atomically and is given a monotonically increasing data version > number that is then used as part of the index key in the cache. So old > versions of the data get recycled as the cache needs to make space. > > Which also means that if swap needs more space, it can just kick stuff out of > fscache if it is not locked in. Ah, more requirements ;-) > > All we need to do is figure out how to name the lookup (I don't think we > > need to use strings to name the swap object, but obviously we could). Maybe > > it's just a stream of bytes. > > A binary blob would probably be better. > > I would use a separate index to map higher level organisations, such as > cell+volume in afs or the server address + share name in cifs to an index > number that can be used in the cache. > > Further, I could do with a way to invalidate all objects matching a particular > subkey. That seems to map to a directory hierarchy? So, named swap objects for fscache; anonymous ones for anon memory?