On Thu, 2022-08-25 at 16:01 +0100, Matthew Wilcox wrote: > On Wed, Aug 24, 2022 at 05:43:36PM +0000, Trond Myklebust wrote: > > On Wed, 2022-08-24 at 17:53 +0100, Matthew Wilcox wrote: > > > On Wed, Aug 24, 2022 at 04:27:04PM +0000, Trond Myklebust wrote: > > > > Right now, I see limited value in adding multipage folios to > > > > NFS. > > > > > > > > While basic NFSv4 does allow you to pretend there is a > > > > fundamental > > > > underlying block size, pNFS has changed all that, and we have > > > > had > > > > to > > > > engineer support for determining the I/O block size on the fly, > > > > and > > > > building the RPC requests accordingly. Client side mirroring > > > > just > > > > adds > > > > to the fun. > > > > > > > > As I see it, the only value that multipage folios might bring > > > > to > > > > NFS > > > > would be smaller page cache management overhead when dealing > > > > with > > > > large > > > > files. > > > > > > Yes, but that's a Really Big Deal. Machines with a lot of memory > > > end > > > up with very long LRU lists. We can't afford the overhead of > > > managing > > > memory in 4kB chunks any more. (I don't want to dwell on this > > > point > > > too > > > much; I've run the numbers before and can do so again if you want > > > me > > > to > > > go into more details). > > > > > > Beyond that, filesystems have a lot of interactions with the page > > > cache > > > today. When I started looking at this, I thought filesystem > > > people > > > all > > > had a deep understanding of how the page cache worked. Now I > > > realise > > > everyone's as clueless as I am. The real benefit I see to > > > projects > > > like > > > iomap/netfs is that they insulate filesystems from having to deal > > > with > > > the page cache. All the interactions are in two or three places > > > and > > > we > > > can refactor without having to talk to the owners of 50+ > > > filesystems. > > > > > > It also gives us a chance to re-examine some of the assumptions > > > that > > > we have made over the years about how filesystems and page cache > > > should > > > be interacting. We've fixed a fair few bugs in recent years that > > > came > > > about because filesystem people don't tend to have deep knowledge > > > of > > > mm > > > internals (and they shouldn't need to!) > > > > > > I don't know that netfs has the perfect interface to be used for > > > nfs. > > > But that too can be changed to make it work better for your > > > needs. > > > > If the VM folks need it, then adding support for multi-page folios > > is a > > much smaller scope than what David was describing. It can be done > > without too much surgery to the existing NFS I/O stack. We already > > have > > code to support I/O block sizes that are much less than the page > > size, > > so converting that to act on larger folios is not a huge deal. > > > > What would be useful there is something like a range tree to allow > > us > > to move beyond the PG_uptodate bit, and help make the > > is_partially_uptodate() address_space_operation a bit more useful. > > Otherwise, we end up having to read in the entire folio, which is > > what > > we do today for pages, but could get onerous with large folios when > > doing file random access. > > This is interesting because nobody's asked for this before. I've had > similar discussions around dirty data tracking, but not around > uptodate. > Random small reads shouldn't be a terrible problem; if they truly are > random, we behave as today, allocating single pages, reading the > entire > page from the server and setting it uptodate. If the readahead code > detects a contiguous large read, we increase the allocation size to > match, but again we always read the entire folio from the server and > mark it uptodate. > > As far as I know, the only time we create !uptodate folios in the > page > cache is partial writes to a folio which has not been previously > read. > Obviously, those bytes start out dirty and are tracked through the > existing dirty mechanism, but once they've been written back, we have > three choices that I can see: > > 1. transition those bytes to a mechanism which records they're > uptodate > 2. discard that information and re-read the entire folio from the > server > if any bytes are subsequently read > 3. read the other bytes in that folio from the server and mark the > entire folio uptodate > > We have a mixture of those options implemented in different > filesystems > today. iomap records whether a block is uptodate or not and treats > every uptodate block as dirty if any block in the folio is dirty. > buffer_head has two bits for each block, separately recording whether > it's dirty and/or uptodate. AFS tracks one dirty range per folio, > but > it first brings the folio uptodate by reading it from the server > before > overwriting it (I suppose that's a fourth option). > I'm not talking about the transition of dirty->clean. We already deal with that. I'm talking about supporting large folios on read-mainly workloads. NFS can happily support 1MB sized folios, or even larger than that if there is a compelling reason to do so. However, having to read in the entire folio contents if the user is just asking for a few bytes on a database-style random read workload can quickly get onerous. While a lot of NFS servers can do 1MB reads in one RPC call, there are still many out there that can't. For those servers, we'd have to fall back to issuing multiple read RPC calls in parallel (which is what we do today if the user sets an rsize < PAGE_SIZE). This leads to unnecessary load on the server, which has to deal with multiple RPC calls for data that won't be used. The other point is that if your network bandwidth is limited, there is value in avoiding reads for data that isn't going to be used, which is why we changed the NFS readahead behaviour to be less aggressive than it used to be. This is why I'm suggesting that if you really want to cut down the LRU table size, you'll want finer grained page up to date tracking than the folio. It's not so much for the case of writes as it is for the read- mostly workloads. -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx