On Fri, Jul 22, 2022 at 08:45:13AM -0400, Anna Schumaker wrote: > On Thu, Jul 21, 2022 at 4:48 PM Josef Bacik <josef@xxxxxxxxxxxxxx> wrote: > > > > On Tue, Jul 19, 2022 at 04:46:50PM -0400, Anna Schumaker wrote: > > > On Sun, Jul 17, 2022 at 9:16 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > > > > > On Fri, Jul 15, 2022 at 07:08:13PM +0000, Chuck Lever III wrote: > > > > > > On Jul 15, 2022, at 2:44 PM, Anna Schumaker <anna@xxxxxxxxxx> wrote: > > > > > > > > > > > > From: Anna Schumaker <Anna.Schumaker@xxxxxxxxxx> > > > > > > > > > > > > Rather than relying on the underlying filesystem to tell us where hole > > > > > > and data segments are through vfs_llseek(), let's instead do the hole > > > > > > compression ourselves. This has a few advantages over the old > > > > > > implementation: > > > > > > > > > > > > 1) A single call to the underlying filesystem through nfsd_readv() means > > > > > > the file can't change from underneath us in the middle of encoding. > > > > > > > > Hi Anna, > > > > > > > > I'm assuming you mean the vfs_llseek(SEEK_HOLE) call at the start > > > > of nfsd4_encode_read_plus_data() that is used to trim the data that > > > > has already been read out of the file? > > > > > > There is also the vfs_llseek(SEEK_DATA) call at the start of > > > nfsd4_encode_read_plus_hole(). They are used to determine the length > > > of the current hole or data segment. > > > > > > > > > > > What's the problem with racing with a hole punch here? All it does > > > > is shorten the read data returned to match the new hole, so all it's > > > > doing is making the returned data "more correct". > > > > > > The problem is we call vfs_llseek() potentially many times when > > > encoding a single reply to READ_PLUS. nfsd4_encode_read_plus() has a > > > loop where we alternate between hole and data segments until we've > > > encoded the requested number of bytes. My attempts at locking the file > > > have resulted in a deadlock since vfs_llseek() also locks the file, so > > > the file could change from underneath us during each iteration of the > > > loop. > > > > > > > > > > > OTOH, if something allocates over a hole that the read filled with > > > > zeros, what's the problem with occasionally returning zeros as data? > > > > Regardless, if this has raced with a write to the file that filled > > > > that hole, we're already returning stale data/hole information to > > > > the client regardless of whether we trim it or not.... > > > > > > > > i.e. I can't see a correctness or data integrity problem here that > > > > doesn't already exist, and I have my doubts that hole > > > > punching/filling racing with reads happens often enough to create a > > > > performance or bandwidth problem OTW. Hence I've really got no idea > > > > what the problem that needs to be solved here is. > > > > > > > > Can you explain what the symptoms of the problem a user would see > > > > that this change solves? > > > > > > This fixes xfstests generic/091 and generic/263, along with this > > > reported bug: https://bugzilla.kernel.org/show_bug.cgi?id=215673 > > > > > > > > > > 2) A single call to the underlying filestem also means that the > > > > > > underlying filesystem only needs to synchronize cached and on-disk > > > > > > data one time instead of potentially many speeding up the reply. > > > > > > > > SEEK_HOLE/DATA doesn't require cached data to be sync'd to disk to > > > > be coherent - that's only a problem FIEMAP has (and syncing cached > > > > data doesn't fix the TOCTOU coherency issue!). i.e. SEEK_HOLE/DATA > > > > will check the page cache for data if appropriate (e.g. unwritten > > > > disk extents may have data in memory over the top of them) instead > > > > of syncing data to disk. > > > > > > For some reason, btrfs on virtual hardware has terrible performance > > > numbers when using vfs_llseek() with files that are already in the > > > server's cache. I think it had something to do with how they lock > > > extents, and some extra work that needs to be done if the file already > > > exists in the server's memory but it's been a few years since I've > > > gone into their code to figure out where the slowdown is coming from. > > > See this section of my performance results wiki page: > > > https://wiki.linux-nfs.org/wiki/index.php/Read_Plus_May_2022#BTRFS_3 > > > > > > > I just did this locally, once in my test vm's and once on my continuous > > performance testing hardware and I'm not able to reproduce the numbers, so I > > think I'm doing something wrong. > > > > My test is stupid, I just dd a 5gib file, come behind it and punch holes every > > other 4k. Then I umount and remount, SEEK_DATA+SEEK_HOLE through the whole > > file, and then do it again so I have uncached and cached. The numbers I'm > > seeing are equivalent to ext4/xfs. Granted on my VM I had to redo the test > > because I had lockdep and other debugging on which makes us look stupid because > > of the extent locking stuff, but with it off everything appears normal. > > > > Does this more or less mirror your testing? Looking at our code we're not doing > > anything inherently stupid, so I'm not entirely sure what could be the problem. > > Thanks, > > I think that's pretty close to what the server is doing with the > current code, except the NFS server would also do a read for every > data segment it found. I've been using `vmtouch` to make sure the file > doesn't get evicted from the server's page cache for my cached data. > I messed with it some more and I can't get it to be slow. If you trip over something like this again just give a shout and I'd be happy to dig in. Thanks, Josef