> On Jul 19, 2022, at 10:36 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Wed, Jul 20, 2022 at 01:26:13AM +0000, Chuck Lever III wrote: >> >> >>> On Jul 19, 2022, at 6:44 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: >>> >>> On Tue, Jul 19, 2022 at 04:46:50PM -0400, Anna Schumaker wrote: >>>> On Sun, Jul 17, 2022 at 9:16 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote: >>>>> >>>>> On Fri, Jul 15, 2022 at 07:08:13PM +0000, Chuck Lever III wrote: >>>>>>> On Jul 15, 2022, at 2:44 PM, Anna Schumaker <anna@xxxxxxxxxx> wrote: >>>>>>> >>>>>>> From: Anna Schumaker <Anna.Schumaker@xxxxxxxxxx> >>>>>>> >>>>>>> Rather than relying on the underlying filesystem to tell us where hole >>>>>>> and data segments are through vfs_llseek(), let's instead do the hole >>>>>>> compression ourselves. This has a few advantages over the old >>>>>>> implementation: >>>>>>> >>>>>>> 1) A single call to the underlying filesystem through nfsd_readv() means >>>>>>> the file can't change from underneath us in the middle of encoding. >>>>> >>>>> Hi Anna, >>>>> >>>>> I'm assuming you mean the vfs_llseek(SEEK_HOLE) call at the start >>>>> of nfsd4_encode_read_plus_data() that is used to trim the data that >>>>> has already been read out of the file? >>>> >>>> There is also the vfs_llseek(SEEK_DATA) call at the start of >>>> nfsd4_encode_read_plus_hole(). They are used to determine the length >>>> of the current hole or data segment. >>>> >>>>> >>>>> What's the problem with racing with a hole punch here? All it does >>>>> is shorten the read data returned to match the new hole, so all it's >>>>> doing is making the returned data "more correct". >>>> >>>> The problem is we call vfs_llseek() potentially many times when >>>> encoding a single reply to READ_PLUS. nfsd4_encode_read_plus() has a >>>> loop where we alternate between hole and data segments until we've >>>> encoded the requested number of bytes. My attempts at locking the file >>>> have resulted in a deadlock since vfs_llseek() also locks the file, so >>>> the file could change from underneath us during each iteration of the >>>> loop. >>> >>> So the problem being solved is that the current encoding is not >>> atomic, rather than trying to avoid any computational overhead of >>> multiple vfs_llseek calls (which are largely just the same extent >>> lookups as we do during the read call)? >> >> Reviewing [1] and [2] I don't find any remarks about atomicity >> guarantees. If a client needs an uncontended view of a file's >> data, it's expected to fence other accessors via a OPEN(deny) >> or LOCK operation, or serialize the requests itself. > > You've got the wrong "atomicity" scope :) > > What I was talking about is the internal server side data operation > atomicity. that is, what is returned from the read to the READ_PLUS > code is not atomic w.r.t. the vfs_llseek() that are then used to > determine where there holes in the data returned by the read are. > > Hence after the read has returned data to READ_PLUS, something else > can modify the data in the file (e.g. filling a hole or punching a > new one out) and then the ranges vfs_llseek() returns to READ_PLUS > does not match the data that is has in it's local buffer. Architecturally, with the NFS protocol, the client and the application running there are responsible for stopping "something else [from] modifying the data in the file." NFS operations in and of themselves are not usually atomic in that respect. A READ operation has the same issue as READ_PLUS: a hole punch can remove data from a file while the server is constructing and encoding a READ reply, unless the application on the client has taken the trouble to block foreign file modifications. > i.e. to do what the READ_PLUS operation is doing now, it would > somehow need to ensure no modifications can be made between the read > starting and the last vfs_llseek() call completing. IOWs, they need > to be performed as an atomic operation, not as a set of > independently locked (or unlocked!) operations as they are now. There is nothing making that guarantee on the server, and as I said, there is nothing I've found in the specs that require that level of atomicity from a single READ or READ_PLUS operation. Maybe I don't understand what you mean by "what the READ_PLUS operation is doing now"? >>> The implementation just seems backwards to me - rather than reading >>> data and then trying to work out where the holes are, I suspect it >>> should be working out where the holes are and then reading the data. >>> This is how the IO path in filesystems work, so it would seem like a >>> no-brainer to try to leverage the infrastructure we already have to >>> do that. >>> >>> The information is there and we have infrastructure that exposes it >>> to the IO path, it's just *underneath* the page cache and the page >>> cache destroys the information that it used to build the data it >>> returns to the NFSD. >>> >>> IOWs, it seems to me that what READ_PLUS really wants is a "sparse >>> read operation" from the filesystem rather than the current "read >>> that fills holes with zeroes". i.e. a read operation that sets an >>> iocb flag like RWF_SPARSE_READ to tell the filesystem to trim the >>> read to just the ranges that contain data. >>> >>> That way the read populates the page cache over a single contiguous >>> range of data and returns with the {offset, len} that spans the >>> range that is read and mapped. The caller can then read that region >>> out of the page cache and mark all the non-data regions as holes in >>> whatever manner they need to. >>> >>> The iomap infrastructure that XFS and other filesystems use provide >>> this exact "map only what contains data" capability - an iomap tells >>> the page cache exactly what underlies the data range (hole, data, >>> unwritten extents, etc) in an efficient manner, so it wouldn't be a >>> huge stretch just to limit read IO ranges to those that contain only >>> DATA extents. >>> >>> At this point READ_PLUS then just needs to iterate doing sparse >>> reads and recording the ranges that return data as vector of some >>> kind that is then passes to the encoding function to encode it as >>> a sparse READ_PLUS data range.... >> >> The iomap approach > > Not actually what I proposed - I'm suggesting a new kiocb flag that > changes what the read passed to the filesystem does. My comments > about iomap are that this infrastructure already provides the extent > range query mechanisms that allow us to efficiently perform such > "restricted range" IO operations. > >> seems sensible to me and covers the two basic >> usage scenarios: >> >> - Large sparse files, where we want to conserve both network >> bandwidth and client (and intermediate) cache occupancy. >> These are best served by exposing data and holes. > > *nod* > >> - Everyday files that are relatively small and generally will >> continue few, if any, holes. These are best served by using >> a splice read (where possible) -- that is, READ_PLUS in this >> case should work exactly like READ. > > *nod* > >> My impression of client implementations is that, a priori, >> a client does not know whether a file contains holes or not, >> but will probably always use READ_PLUS and let the server >> make the choice for it. > > *nod* > >> Now how does the server make that choice? Is there an attribute >> bit that indicates when a file should be treated as sparse? Can >> we assume that immutable files (or compressed files) should >> always be treated as sparse? Alternately, the server might use >> the file's data : hole ratio. > > None of the above. The NFS server has no business knowing intimate > details about how the filesystem has laid out the file. All it cares > about ranges containing data and ranges that have no data (holes). That would be the case if we want nothing more than impermeable software layering. That's nice for software developers, but maybe not of much benefit to average users. I see no efficiency benefit, for example, if a 10MB object file has 512 bytes of zeroes at a convenient offset and the server returns that as DATA/HOLE/DATA. The amount of extra work it has to do to make that happen results in the same latencies as transmitting 512 extra bytes on GbE. It might be even worse on faster network fabrics. On fast networks, the less the server's host CPU has to be involved in doing the READ, the better it scales. It's better to set up the I/O and step out of the way; use zero touch as much as possible. Likewise on the client: it might receive a CONTENT_HOLE, but then its CPU has to zero out that range, with all the memory and cache activity that entails. For small holes, that's going to be a lot of memset(0). If the server returns holes only when they are large, then the client can use more efficient techniques (like marking page cache pages or using ZERO_PAGE). On networks with large bandwidth-latency products, however, it makes sense to trade as much server and client CPU and memory for transferred bytes as you can. The mechanism that handles sparse files needs to be distinct from the policy of when to return more than a single CONTENT_DATA segment, since one of our goals is to ensure that READ_PLUS performs and scales as well as READ on common workloads (ie, not HPC / large sparse file workloads). > If the filesystem can support a sparse read, it returns sparse > ranges containing data to the NFS server. If the filesystem can't > support it, or it's internal file layout doesn't allow for efficient > hole/data discrimination, then it can just return the entire read > range. > > Alternatively, in this latter case, the filesystem could call a > generic "sparse read" implementation that runs memchr_inv() to find > the first data range to return. Then the NFS server doesn't have to > code things differently, filesystems don't need to advertise > support for sparse reads, etc because every filesystem could > support sparse reads. > > The only difference is that some filesystems will be much more > efficient and faster at it than others. We already see that sort > of thing with btrfs and seek hole/data on large cached files so I > don't see "filesystems perform differently" as a problem here... The problem with that approach is that will result in performance regressions on NFSv4.2 with exports that reside on underperforming filesystem types. We need READ_PLUS to perform as well as READ so there is no regression between NFSv4.2 without and with READ_PLUS, and no regression when migrating from NFSv4.1 to NFSv4.2 with READ_PLUS enabled. -- Chuck Lever