Re: [PATCH 3/3] nfsd: simplify write verifier handling

Trond Myklebust <trondmy@xxxxxxxxxx> · Tue, 14 Feb 2023 18:34:46 -0500

On Tue, 2023-02-14 at 14:57 -0800, Rick Macklem wrote:
> On Tue, Feb 14, 2023 at 5:53 AM Jeff Layton <jlayton@xxxxxxxxxx>
> wrote:
> > 
> > On Mon, 2023-02-13 at 22:28 -0500, Trond Myklebust wrote:
> > > On Mon, 2023-02-13 at 16:49 -0800, Rick Macklem wrote:
> > > > On Mon, Feb 13, 2023 at 1:14 PM Jeff Layton
> > > > <jlayton@xxxxxxxxxx>
> > > > wrote:
> > > > > 
> > > > > CAUTION: This email originated from outside of the University
> > > > > of
> > > > > Guelph. Do not click links or open attachments unless you
> > > > > recognize
> > > > > the sender and know the content is safe. If in doubt, forward
> > > > > suspicious emails to IThelp@xxxxxxxxxxx
> > > > > 
> > > > > 
> > > > > The write verifier exists to tell the client when the server
> > > > > may
> > > > > have
> > > > > forgotten some unstable writes. The typical way that this
> > > > > happens
> > > > > is if
> > > > > the server crashes, but we've also extended nfsd to change it
> > > > > when
> > > > > there
> > > > > are writeback errors as well.
> > > > > 
> > > > > The way it works today though, we call something like
> > > > > vfs_fsync
> > > > > (e.g.
> > > > > for a COMMIT call) and if we get back an error, we'll reset
> > > > > the
> > > > > write
> > > > > verifier.
> > > > > 
> > > > > This is non-optimal for a couple of reasons:
> > > > > 
> > > > > 1/ There could be significant delay between an error being
> > > > > recorded and the reset. It would be ideal if the write
> > > > > verifier
> > > > > were to
> > > > > change as soon as the error was recorded.
> > > > > 
> > > > > 2/ It's a bit of a waste, in that if we get a writeback error
> > > > > on a
> > > > > single inode, we'll end up resetting the write verifier for
> > > > > everything,
> > > > > even on inodes that may be fine (e.g. on a completely
> > > > > separate fs).
> > > > > 
> > > > Here's the snippet from RFC8881:
> > > >    The final portion of the result is the field writeverf. 
> > > > This
> > > > field
> > > >    is the write verifier and is a cookie that the client can
> > > > use to
> > > >    determine whether a server has changed instance state (e.g.,
> > > > server
> > > >    restart) between a call to WRITE and a subsequent call to
> > > > either
> > > >    WRITE or COMMIT.  This cookie MUST be unchanged during a
> > > > single
> > > >    instance of the NFSv4.1 server and MUST be unique between
> > > > instances
> > > >    of the NFSv4.1 server.  If the cookie changes, then the
> > > > client
> > > > MUST
> > > >    assume that any data written with an UNSTABLE4 value for
> > > > committed
> > > >    and an old writeverf in the reply has been lost and will
> > > > need to
> > > > be
> > > >    recovered.
> > > > 
> > > > I've always interpreted the writeverf as "per-server" and not 
> > > > "per-
> > > > file".
> > > > Although I'll admit the above does not make that crystal clear,
> > > > it
> > > > does make
> > > > it clear that the writeverf applies to a "server instance" and
> > > > not a
> > > > file or
> > > > file system on the server.
> > > > 
> > > > The FreeBSD client assumes it is "per-server" and re-writes all
> > > > uncommitted
> > > > writes for the server, not just ones for the file (or file
> > > > system)
> > > > the
> > > > writeverf is
> > > > replied with.  (I vaguely recall Solaris does the same?)
> > > > 
> > > > At the very least, I think you should run this past the IETF
> > > > working
> > > > group
> > > > (nfsv4@xxxxxxxx) to see what they say w.r.t. the writeverf
> > > > being
> > > > "per-file" vs
> > > > "per-server".
> > > > 
> > > 
> > > As I recall, we've already had this discussion on the IETF NFSv4
> > > working group mailing list:
> > > https://mailarchive.ietf.org/arch/msg/nfsv4/99Ow2muMylXKWd9lzi9_BX2LJDY/
> > > 
> > > 
> > > That's why I kept it a global in the first place.
> > > 
> > > Now note that RFC8881 does also clarify in Section 18.3.3 that:
> > > 
> > > 
> > >    The server must vary the value of the write
> > >    verifier at each server event or instantiation that may lead
> > > to a
> > >    loss of uncommitted data.  Most commonly this occurs when the
> > > server
> > >    is restarted; however, other events at the server may result
> > > in
> > >    uncommitted data loss as well.
> > > 
> > > So I feel it is quite OK to use the verifier the way we do now in
> > > order
> > > to signify that a fatal write error has occurred and that clients
> > > must
> > > resend any data that was uncommitted.
> > > 
> > 
> > Thanks, I missed that discussion. I think you guys have convinced
> > me
> > that we have to keep this per-server. I won't bother starting a new
> > thread on it.
> > 
> > It's a pity. It would have been a lot more elegant as a per-inode
> > thing!
> > 
> If you think it is worth the effort, you could propose an extension
> to
> 4.2. Something like Write_plus, Commit_plus operations.
> 
> rick
> 
OK. Apparently some further clarification is required here.

The main reason for needing to bump the boot verifier on an error is
NFSv3. Unlike NFSv4, the NFSv3 clients are stateless, and so error
propagation using Jeff's "errseq" mechanism is inherently flawed
because it is ultimately caching the I/O errors in a file descriptor.
The fact that the file cache garbage collector can close these NFSv3
file descriptors at any time without any possibility of coordination
with the clients, and causing loss of the "errseq" cached data, means
that we must have an alternative mechanism to propagate error state for
unstable writes. Hence the use of the boot verifier.

NFSv4 does not have these limitations. The clients are required to use
stateful OPEN/CLOSE/DELEGRETURN/... operations in order to signify to
the NFSv4 server when file descriptors need to be kept open, and hence
"errseq" data needs to be preserved. The only cases where that
assumption is violated are that of a network partition and server
reboot, where it is obvious to all parties that information has been
lost.
Note: The Linux NFSv4 client does not currently assume that information
might be lost during network partition, so we might want to look into
fixing that. However it does obviously recover safely during a server
reboot thanks to the boot verifier mechanism.

IOW: there should be no need for a change in NFSv4 semantics in order
to address this issue. The problem is, as usual, mostly limited to
NFSv3.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx