On Tue, May 21, 2024 at 1:43 PM Chris Li <chrisl@xxxxxxxxxx> wrote: > > Swap and file systems have very different requirements and usage > patterns and IO patterns. I would counter that the design requirements for a simple filesystem and what you are proposing doing to support heterogeneously sized block allocation on a block device are very similar, not very different. Data is owned by clients, but I've done the profiling on servers and Android. As I've stated before, databases have reasonably close usage and IO. Swap usage of block devices is not a particularly odd usage profile. > One challenging aspect is that the current swap back end has a very > low per swap entry memory overhead. It is about 1 byte (swap_map), 2 > byte (swap cgroup), 8 byte(swap cache pointer). The inode struct is > more than 64 bytes per file. That is a big jump if you map a swap > entry to a file. If you map more than one swap entry to a file, then > you need to track the mapping of file offset to swap entry, and the > reverse lookup of swap entry to a file with offset. Whichever way you > cut it, it will significantly increase the per swap entry memory > overhead. No it won't. Because the suggestion is NOT to add some array of inode structs in place of the structures you've been talking about altering. IIUC your proposals per the "Swap Abstraction LSF_MM 2024.pdf" are to more than double the per entry overhead from 11 B to 24 B. Is that correct? Of course if modernizing the structures to be properly folio aware requires a few bytes, that seems prudent. Also IIUC 8 bytes of the 24 are a per swap entry pointer to a dynamically allocated structure that will be used to manage heterogeneous block size allocation management on block devices. I object to this. That's what the filesystem abstraction is for. EXT4 too heavy for you? Then make a simpler filesystem. So how do you map swap entries to a filesystem without a new mapping layer? Here is a simple proposal. (It assumes there are only 16 valid folio orders. There are ways to get around that limit but it would take longer to explain, so let's just go with it.) * swap_types (fs inodes) map to different page sizes (page, compound order, folio order, mTHP size etc). ex. swap_type == 1 -> 4K pages, swap_type == 15 -> 1G hugepages etc * swap_type = fs inode * swap_offset = fs file offset * swap_offset is selected using the same simple allocation scheme as today. - because the swap entries are all the same size/order per swap_type/inode you can just pick the first free slot. * on freeing a swap entry call fallocate(FALLOC_FL_PUNCH_HOLE) - removes the blocks from the file without changing its "size". - no changes are required to the swap_offsets to garbage collect blocks. This allows you the following: * dynamic allocation of block space between sizes/orders * avoids any new tracking structures in memory for all swap entries * places burden of tracking on filesystem