On Sat, 2023-03-04 at 07:34 +0000, Matthew Wilcox wrote: > On Fri, Mar 03, 2023 at 08:11:47AM -0500, James Bottomley wrote: > > On Fri, 2023-03-03 at 03:49 +0000, Matthew Wilcox wrote: > > > On Thu, Mar 02, 2023 at 06:58:58PM -0700, Keith Busch wrote: > > > > That said, I was hoping you were going to suggest supporting > > > > 16k logical block sizes. Not a problem on some arch's, but > > > > still problematic when PAGE_SIZE is 4k. :) > > > > > > I was hoping Luis was going to propose a session on LBA size > > > > PAGE_SIZE. Funnily, while the pressure is coming from the storage > > > vendors, I don't think there's any work to be done in the storage > > > layers. It's purely a FS+MM problem. > > > > Heh, I can do the fools rush in bit, especially if what we're > > interested in the minimum it would take to support this ... > > > > The FS problem could be solved simply by saying FS block size must > > equal device block size, then it becomes purely a MM issue. > > Spoken like somebody who's never converted a filesystem to > supporting large folios. There are a number of issues: > > 1. The obvious; use of PAGE_SIZE and/or PAGE_SHIFT Well, yes, a filesystem has to be aware it's using a block size larger than page size. > 2. Use of kmap-family to access, eg directories. You can't kmap > an entire folio, only one page at a time. And if a dentry is > split across a page boundary ... Is kmap relevant? It's only used for reading user pages in the kernel and I can't see why a filesystem would use it unless it wants to pack inodes into pages that also contain user data, which is an optimization not a fundamental issue (although I grant that as the blocksize grows it becomes more useful) so it doesn't have to be part of the minimum viable prototype. > 3. buffer_heads do not currently support large folios. Working on > it. Yes, I always forget filesystems still use the buffer cache. But fundamentally the buffer_head structure can cope with buffers that span pages so most of the logic changes would be around grow_dev_page(). It seems somewhat messy but not too hard. > Probably a few other things I forget. But look through the recent > patches to AFS, CIFS, NFS, XFS, iomap that do folio conversions. > A lot of it is pretty mechanical, but some of it takes hard thought. > And if you have ideas about how to handle ext2 directories, I'm all > ears. OK, so I can see you were waiting for someone to touch a nerve, but if I can go back to the stated goal, I never really thought *every* filesystem would be suitable for block size > page size, so simply getting a few of the modern ones working would be good enough for the minimum viable prototype. > > > The MM issue could be solved by adding a page order attribute to > > struct address_space and insisting that pagecache/filemap functions > > in mm/filemap.c all have to operate on objects that are an integer > > multiple of the address space order. The base allocator is > > filemap_alloc_folio, which already has an apparently always zero > > order parameter (hmmm...) and it always seems to be called from > > sites that > > have the address_space, so it could simply be modified to always > > operate at the address_space order. > > Oh, I have a patch for that. That's the easy part. The hard part is > plugging your ears to the screams of the MM people who are convinced > that fragmentation will make it impossible to mount your filesystem. Right, so if the MM issue is solved it's picking a first FS for conversion and solving the buffer problem. I fully understand that eventually we'll need to get a single large buffer to span discontiguous pages ... I noted that in the bit you cut, but I don't see why the prototype shouldn't start with contiguous pages. James