On Fri, Jan 19, 2024 at 09:19:38AM +1100, Dave Chinner wrote: > The XFS buffer cache supports metadata buffers up to 64kB, and it does so by > aggregating multiple pages into a single contiguous memory region using > vmapping. This is expensive (both the setup and the runtime TLB mapping cost), > and would be unnecessary if we could allocate large contiguous memory regions > for the buffers in the first place. > > Enter multi-page folios. LOL, hch and I just wrapped up making the xfbtree buffer cache work with large folios coming from tmpfs. Though the use case there is simpler because we require blocksize==PAGE_SIZE, forbid the use of highmem, and don't need discontig buffers. Hence we sidestep vm_map_ram. :) > This patchset converts the buffer cache to use the folio API, then enhances it > to optimisitically use large folios where possible. It retains the old "vmap an > array of single page folios" functionality as a fallback when large folio > allocation fails. This means that, like page cache support for large folios, we > aren't dependent on large folio allocation succeeding all the time. > > This relegates the single page array allocation mechanism to the "slow path" > that we don't have to care so much about performance of this path anymore. This > might allow us to simplify it a bit in future. > > One of the issues with the folio conversion is that we use a couple of APIs that > take struct page ** (i.e. pointers to page pointer arrays) and there aren't > folio counterparts. These are the bulk page allocator and vm_map_ram(). In the > cases where they are used, we cast &bp->b_folios[] to (struct page **) knowing > that this array will only contain single page folios and that single page folios > and struct page are the same structure and so have the same address. This is a > bit of a hack (hence the RFC) but I'm not sure that it's worth adding folio > versions of these interfaces right now. We don't need to use the bulk page > allocator so much any more, because that's now a slow path and we could probably > just call folio_alloc() in a loop like we used to. What to do about vm_map_ram() > is a little less clear.... Yeah, that's what I suspected. > The other issue I tripped over in doing this conversion is that the > discontiguous buffer straddling code in the buf log item dirty region tracking > is broken. We don't actually exercise that code on existing configurations, and > I tripped over it when tracking down a bug in the folio conversion. I fixed it > and short-circuted the check for contiguous buffers, but that didn't fix the > failure I was seeing (which was not handling bp->b_offset and large folios > properly when building bios). Yikes. > Apart from those issues, the conversion and enhancement is relatively straight > forward. It passes fstests on both 512 and 4096 byte sector size storage (512 > byte sectors exercise the XBF_KMEM path which has non-zero bp->b_offset values) > and doesn't appear to cause any problems with large directory buffers, though I > haven't done any real testing on those yet. Large folio allocations are > definitely being exercised, though, as all the inode cluster buffers are 16kB on > a 512 byte inode V5 filesystem. > > Thoughts, comments, etc? Not yet. > Note: this patchset is on top of the NOFS removal patchset I sent a > few days ago. That can be pulled from this git branch: > > https://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-kmem-cleanup Oooh a branch link, thank you. It's so much easier if I can pull a branch while picking through commits over gitweb. --D > >