Re: [PATCH v3] nfsd: disallow file locking and delegations for NFSv4 reexport

Daire Byrne <daire@xxxxxxxx> · Tue, 19 Nov 2024 00:37:56 +0000

On Mon, 18 Nov 2024 at 18:57, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>
> On Thu, Oct 31, 2024 at 11:14:51AM -0400, Chuck Lever wrote:
> > On Wed, Oct 23, 2024 at 11:58:46AM -0400, Mike Snitzer wrote:
> > > We do not and cannot support file locking with NFS reexport over
> > > NFSv4.x for the same reason we don't do it for NFSv3: NFS reexport
>
>  [ ... patch snipped ... ]
>
> > > diff --git a/Documentation/filesystems/nfs/reexport.rst b/Documentation/filesystems/nfs/reexport.rst
> > > index ff9ae4a46530..044be965d75e 100644
> > > --- a/Documentation/filesystems/nfs/reexport.rst
> > > +++ b/Documentation/filesystems/nfs/reexport.rst
> > > @@ -26,9 +26,13 @@ Reboot recovery
> > >  ---------------
> > >
> > >  The NFS protocol's normal reboot recovery mechanisms don't work for the
> > > -case when the reexport server reboots.  Clients will lose any locks
> > > -they held before the reboot, and further IO will result in errors.
> > > -Closing and reopening files should clear the errors.
> > > +case when the reexport server reboots because the source server has not
> > > +rebooted, and so it is not in grace.  Since the source server is not in
> > > +grace, it cannot offer any guarantees that the file won't have been
> > > +changed between the locks getting lost and any attempt to recover them.
> > > +The same applies to delegations and any associated locks.  Clients are
> > > +not allowed to get file locks or delegations from a reexport server, any
> > > +attempts will fail with operation not supported.
> > >
> > >  Filehandle limits
> > >  -----------------
>
> Note for Mike:
>
> Last sentence "Clients are not allowed to get ... delegations from a
> reexport server" -- IIUC it's up to the re-export server to not hand
> out delegations to its clients. Still, it's important to note that
> NFSv4 delegation would not be available for re-exports.
>
> See below for more: I'd like this paragraph to continue to discuss
> the issue of OPEN and I/O behavior when the re-export server
> restarts. The patch seems to redact that bit of detail.
>
> Following is general discussion:
>
>
> > There seems to be some controversy about this approach.
> >
> > Also I think it would be nicer all around if we followed the usual
> > process for changes that introduce possible behavior regressions:
> >
> >  - add the new behavior, make it optional, default old behavior
> >  - wait a few releases
> >  - change the default to new behavior
> >
> > Lastly, there haven't been any user complaints about the current
> > situation of no lock recovery in the re-export case.
> >
> > Jeff and I discussed this, and we plan to drop this one for 6.13 but
> > let the conversation continue. Mike, no action needed on your part
> > for the moment, but please stay tuned!
> >
> > IMO having an export option (along the lines of "async/sync") that
> > is documented in a man page is going to be a better plan. But if we
> > find a way to deal with this situation without a new administrative
> > control, that would be even better.
>
> Proposed solutions so far:
>
> - Disable NFS locking entirely on NFS re-export
>
> - Implement full state pass-through for re-export
>
> Some history of the NFSD design and the re-export issue is provided
> here:
>
>   http://wiki.linux-nfs.org/wiki/index.php/NFS_re-export#reboot_recovery
>
> Certain usage scenarios require that lock state be globally visible,
> so disabling NFS locking on re-export mounts will need to be
> considered carefully.
>
> Assuming that NFSv4 LOCK operations are proliferated to the back-end
> server in today's NFSD, does it make sense to avoid code changes at
> the moment, but more carefully document the configuration options
> and their risks?
>
> +++ In all following configurations, no state recovery occurs when
> the re-export server restarts, as explained in
> Documentation/filesystems/nfs/reexport.rst.
>
> Mount options on the re-export server and clients:
>
> * All default: open and lock state is proliferated to the back-end
>   server and is visible to all NFS clients.
>
> * local_lock=all on the re-export server's mount of the back-end
>   server: clients of that server all see the same set of locks, but
>   these locks are not visible to the back-end server or any of its
>   clients. Open state is visible everywhere.
>
> * local_lock=all on the NFS mounts on client mounts of the re-export
>   server: applications on NFS clients do not see locks set by
>   applications on any other NFS clients. Open state is visible
>   everywhere.
>
> When an NFS client of the re-export server OPENs a file, currently
> that creates OPEN state on the re-export server, and I assume also
> on the back-end server. That state cannot be recovered if the
> re-export server restarts, but it also cannot be blocked by a mount
> option.
>
> Likewise, I assume the back-end server can hand out delegations to
> the re-export server. If the re-export server restarts, how does it
> recover those delegations? The re-export server could disable
> delegation by blocking off its callback service, but should it?
>
> What, if anything, is being done to further develop and regularly
> test NFS re-export in upstream kernels?
>
> The reexport.rst file: This still reads more like design notes than
> administrative documentation.  IMHO it should instead have a more
> detailed description and disclaimer regarding what kind of manual
> recovery is needed after a re-export server restart. That seems like
> important information for administrators who think they might want
> to deploy this solution. Maybe Documentation/ isn't the right place
> for administrative documentation?
>
> It might be prudent to (temporarily) label NFS re-export as
> experimental use only, given its incompleteness and the long list
> of caveats.

As someone who uses NFSv3 re-export extensively in production, I can't
comment much on the "correctness" of the current locking, but it is
"good enough" for us (we don't explicitly mount with local locks atm).

The unique thing about our workloads though is that other than maybe
the odd log file or home directory shell history file, a single
process always writes a new unique file and we never overwrite. We
have an asset management DB that determines the file paths to be
written and a batch system to run processes (i.e. a production
pipeline + render farm).

We also really try to avoid having either the origin backend server or
re-export server crash/reboot. But even when once a year something
does invariably go wrong, we are willing to take the hit on broken
mounts, processes or corrupted files (just re-run the batch jobs).

Basically the upsides outweigh the downsides for our specific workloads.

Coupled with FS-Cache and a few TBs of storage, using a re-export
server is a very efficient way to serve files to many clients over a
bandwidth constrained and/or high latency WAN link. In the case of
high latency (e.g. global offices), we even do things like increase
actimeo and disable CTO to reduce repeat metadata round-trips to the
absolute minimum. Again, I think we have a unique workload that allows
for this.

If the locks will eventually be passed through to the backend server,
then I suspect we would still want a way to opt out to reduce WAN
latency overhead at the expense of locking correctness (maybe just
using local locks).

I think others with similar workloads are using it in this way too and
I know Google were maintaining a howto to help customers migrate
workloads to their cloud:

https://github.com/GoogleCloudPlatform/knfsd-cache-utils
https://cloud.google.com/architecture/deploy-nfs-caching-proxy-compute-engine

Although it seems like that specific project has gone a bit quiet of
late. They also helped get the reexport/crossmount fsidd helper merged
into nfs-utils.

I have also heard others say reexports are useful for "converting"
NFSv4 storage to NFSv3 (or vice-versa) for older non-NFSv4 clients or
servers, but I'm not sure how big a thing that is in this day and age.

I guess Netapp's "FlexCache" product is a doing a similar thing to
reexporting and seems to lean heavily on NFSv4 and delegations to
achieve that? The latest version can even do write-back caching on
files (get lock first, write back later).

I could probably write a whole (longish) thread about the different
ways we currently use NFS re-exporting and some of the remaining
pitfalls if there is any interest in that...

Daire