Re: [PATCH v3 6/6] NFSD: Repeal and replace the READ_PLUS implementation

Josef Bacik <josef@xxxxxxxxxxxxxx> · Thu, 21 Jul 2022 16:47:55 -0400

On Tue, Jul 19, 2022 at 04:46:50PM -0400, Anna Schumaker wrote:
> On Sun, Jul 17, 2022 at 9:16 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> >
> > On Fri, Jul 15, 2022 at 07:08:13PM +0000, Chuck Lever III wrote:
> > > > On Jul 15, 2022, at 2:44 PM, Anna Schumaker <anna@xxxxxxxxxx> wrote:
> > > >
> > > > From: Anna Schumaker <Anna.Schumaker@xxxxxxxxxx>
> > > >
> > > > Rather than relying on the underlying filesystem to tell us where hole
> > > > and data segments are through vfs_llseek(), let's instead do the hole
> > > > compression ourselves. This has a few advantages over the old
> > > > implementation:
> > > >
> > > > 1) A single call to the underlying filesystem through nfsd_readv() means
> > > >   the file can't change from underneath us in the middle of encoding.
> >
> > Hi Anna,
> >
> > I'm assuming you mean the vfs_llseek(SEEK_HOLE) call at the start
> > of nfsd4_encode_read_plus_data() that is used to trim the data that
> > has already been read out of the file?
> 
> There is also the vfs_llseek(SEEK_DATA) call at the start of
> nfsd4_encode_read_plus_hole(). They are used to determine the length
> of the current hole or data segment.
> 
> >
> > What's the problem with racing with a hole punch here? All it does
> > is shorten the read data returned to match the new hole, so all it's
> > doing is making the returned data "more correct".
> 
> The problem is we call vfs_llseek() potentially many times when
> encoding a single reply to READ_PLUS. nfsd4_encode_read_plus() has a
> loop where we alternate between hole and data segments until we've
> encoded the requested number of bytes. My attempts at locking the file
> have resulted in a deadlock since vfs_llseek() also locks the file, so
> the file could change from underneath us during each iteration of the
> loop.
> 
> >
> > OTOH, if something allocates over a hole that the read filled with
> > zeros, what's the problem with occasionally returning zeros as data?
> > Regardless, if this has raced with a write to the file that filled
> > that hole, we're already returning stale data/hole information to
> > the client regardless of whether we trim it or not....
> >
> > i.e. I can't see a correctness or data integrity problem here that
> > doesn't already exist, and I have my doubts that hole
> > punching/filling racing with reads happens often enough to create a
> > performance or bandwidth problem OTW. Hence I've really got no idea
> > what the problem that needs to be solved here is.
> >
> > Can you explain what the symptoms of the problem a user would see
> > that this change solves?
> 
> This fixes xfstests generic/091 and generic/263, along with this
> reported bug: https://bugzilla.kernel.org/show_bug.cgi?id=215673
> >
> > > > 2) A single call to the underlying filestem also means that the
> > > >   underlying filesystem only needs to synchronize cached and on-disk
> > > >   data one time instead of potentially many speeding up the reply.
> >
> > SEEK_HOLE/DATA doesn't require cached data to be sync'd to disk to
> > be coherent - that's only a problem FIEMAP has (and syncing cached
> > data doesn't fix the TOCTOU coherency issue!).  i.e. SEEK_HOLE/DATA
> > will check the page cache for data if appropriate (e.g. unwritten
> > disk extents may have data in memory over the top of them) instead
> > of syncing data to disk.
> 
> For some reason, btrfs on virtual hardware has terrible performance
> numbers when using vfs_llseek() with files that are already in the
> server's cache. I think it had something to do with how they lock
> extents, and some extra work that needs to be done if the file already
> exists in the server's memory but it's been  a few years since I've
> gone into their code to figure out where the slowdown is coming from.
> See this section of my performance results wiki page:
> https://wiki.linux-nfs.org/wiki/index.php/Read_Plus_May_2022#BTRFS_3
>

I just did this locally, once in my test vm's and once on my continuous
performance testing hardware and I'm not able to reproduce the numbers, so I
think I'm doing something wrong.

My test is stupid, I just dd a 5gib file, come behind it and punch holes every
other 4k.  Then I umount and remount, SEEK_DATA+SEEK_HOLE through the whole
file, and then do it again so I have uncached and cached.  The numbers I'm
seeing are equivalent to ext4/xfs.  Granted on my VM I had to redo the test
because I had lockdep and other debugging on which makes us look stupid because
of the extent locking stuff, but with it off everything appears normal.

Does this more or less mirror your testing?  Looking at our code we're not doing
anything inherently stupid, so I'm not entirely sure what could be the problem.
Thanks,

Josef