The XFS buffer cache supports metadata buffers up to 64kB, and it does so by aggregating multiple pages into a single contiguous memory region using vmapping. This is expensive (both the setup and the runtime TLB mapping cost), and would be unnecessary if we could allocate large contiguous memory regions for the buffers in the first place. Enter multi-page folios. This patchset converts the buffer cache to use the folio API, then enhances it to optimisitically use large folios where possible. It retains the old "vmap an array of single page folios" functionality as a fallback when large folio allocation fails. This means that, like page cache support for large folios, we aren't dependent on large folio allocation succeeding all the time. This relegates the single page array allocation mechanism to the "slow path" that we don't have to care so much about performance of this path anymore. This might allow us to simplify it a bit in future. One of the issues with the folio conversion is that we use a couple of APIs that take struct page ** (i.e. pointers to page pointer arrays) and there aren't folio counterparts. These are the bulk page allocator and vm_map_ram(). In the cases where they are used, we cast &bp->b_folios[] to (struct page **) knowing that this array will only contain single page folios and that single page folios and struct page are the same structure and so have the same address. This is a bit of a hack, so I've ported Christoph's vmalloc()-only fallback patchset on top of these folio changes to remove both the bulk page allocator and the calls to vm_map_ram(). This greatly simplies the allocation and freeing fallback paths, so it's a win all around. The other issue I tripped over in doing this conversion is that the discontiguous buffer straddling code in the buf log item dirty region tracking is broken. We don't actually exercise that code on existing configurations, and I tripped over it when tracking down a bug in the folio conversion. I fixed it and short-circuted the check for contiguous buffers, but that left the code still in place and not executed. However, Christoph's unmapped buffer removal patch gets rid of unmapped buffers, and so we never straddle pages in buffers anymore and so that code goes away entirely by the end of the patch set. More wins! Apart from those small complexities that are resolved by the end of the patchset, the conversion and enhancement is relatively straight forward. It passes fstests on both 512 and 4096 byte sector size storage (512 byte sectors exercise the XBF_KMEM path which has non-zero bp->b_offset values) and doesn't appear to cause any problems with large 64kB directory buffers on 4kB page machines. Version 2: - use get_order() instead of open coding - append in Christoph's unmapped buffer removal - rework Christoph's vmalloc-instead-of-vm_map_ram to apply to the large folio based code. This greatly simplifies everything. Version 1: https://lore.kernel.org/linux-xfs/20240118222216.4131379-1-david@xxxxxxxxxxxxx/