The current folio work is focused on permitting the VM to use physically contiguous chunks of memory. Both Darrick and Johannes have pointed out the advantages of supporting logically-contiguous, physically-discontiguous chunks of memory. Johannes wants to be able to use order-0 allocations to allocate larger folios, getting the benefit of managing the memory in larger chunks without requiring the memory allocator to be able to find contiguous chunks. Darrick wants to support non-power-of-two block sizes. Johannes' ask is more readily achievable. It requires a bit in struct page to distinguish between contiguous/discontiguous folios. The biggest change would probably be to folio_page(), which will need a way to find each subpage given the struct folio. There are several different ways to accomplish this (and I don't want to get into a detailed design discussion here). page_folio() actually requires no change at all; compound_head() can move to any struct page. There are challenges with things like adding a folio to a bio_vec. The current code assumes that if we can add some of the folio, we can add all of the folio, and that's not true for discontiguous folios (we might run out of room in the bio_vec and have to allocate a new one). It's a SMOP to fix. There are probably other problems I haven't thought of, but I don't expect them to be terribly hard to fix. Non-power-of-two folios are more awkward. Any calls to folio_order() and folio_shift() have to be guarded by folio_test_contig() (which will be fine). The bigger problem is the radix tree. It really only works for power-of-two sized objects (and honestly, it's not even all that great for things which aren't a power of 64 in size). See appendix for more details. Liam and I (mostly Liam) have been working on the maple tree data structure. It naturally supports arbitrary-sized objects. It does not currently support three important features: 1. It is inefficient for objects of size 1. The leaf node currently stores 15 indices and 16 pointers. By adding a new leaf node type, that can be increased to 31 pointers, effectively making it twice as dense. 2. It does not support search marks. The page cache relies on being able to efficiently search for pages which are marked as dirty or writeback. Again, a new node type can fix this by sacrificing one pointer per level of the tree in order to mark subnodes/objects. 3. It does not support the 'private_list' / xa_update_node_t, which is used by the workingset code to prune shadow entries under memory pressure. I think the maple tree should take a different approach from the radix tree. Instead of pruning nodes which contain only shadow entries, we should prune shadow entries from the tree which will cause the tree to shrink. That calls for marking inodes as containing shadow entries, and removing old shadow entries, maybe on an LRU of some kind. I haven't thought deeply about this one, but I'm convinced that (as with so many things) a real tree behaves better than the radix tree. The maple tree is also better at storing "random" power of two sizes. In the worst case, an order-5 page will take up 32 pointers (256 bytes; 4 cache lines) in the tree. Even an order-2 page (the most common allocation) takes 4 pointers. That kind of wasted space offers considerable opportunity for the maple tree to do better than the radix tree for real workloads. If nobody beats me to it, I expect to spend some time on the maple tree soon. This is a good time to offer suggestions, as opposed to waiting until the pull request to tell me the entire idea is wrong. Appendix: The fsync_range problem The radix tree works by indexing into a 64-element table 6 bits at a time. Assuming a 64-bit lookup index, it uses the top 4 bits to choose a node from the top-level array, then it uses bits 54-59 to choose the next node, then bits 48-53 for the next node, and so on. This is optimised to start further down the tree for trees which have no entries above a certain index, saving memory and lookup time. When we mark a page as dirty, we set a search mark in the node that contains that page (and recursively all its parents, up to the root node), ensuring that when we walk the tree looking for dirty pages, we will find it. That search is done by vfs_fsync_range() (there are a number of syscalls which will get us there, eg sync_file_range(), msync(), IORING_OP_FSYNC) It calls the fs-specific ->fsync() method, which usually calls file_write_and_wait_range(), calls __filemap_fdatawrite_range() calls do_writepages() calls ->writepages() which usually calls write_cache_pages() calls pagevec_lookup_range_tag() calls find_get_pages_range_tag() calls find_get_entry(). Here's the problem. We might have a 20kB folio at index 4095-4099 of a file. When we dirty the page at index 4097, we mark the head page as dirty and so we mark index 4095 as dirty (which marks the node containing 4032-4095 as dirty, and its parent node 0-4095 as dirty, and its parent node 0-262143 as dirty). Then we msync() the page at index 4097, so we start a search looking for dirty pages from index 4097-4097. Node 0-262143 is dirty, so we step down into it, but then node 4096-8193 is not dirty, so we don't even look in it. This isn't a problem for power-of-two folios. All subpages are in the same node as the head page. When the XArray does a mark search, it actually loads the entry, even if it's not marked. If that entry is a sibling entry, it checks to see whether the canonical entry is marked or not. We can't solve this by having the filesystem "round up" its search to the nearest multiple of 20kB. It would fix this particular example, but if we happen to have allocated a 40kB page at index 8190, dirtied-and-then-searched-for index 8207, we'd be looking for a start of 8200 and miss the dirty mark of index 8190. We could solve this by marking every subpage as being dirty, but that makes marking rather inefficient (we could end up having to mark twice as many nodes as we currently do) and complicated. I think it's better to admit we've reached the limit of what a radix tree can do for us and move to the maple tree.