Re: nfsd_copy_write_verifier: wrong usage of read_seqbegin_or_lock()

Oleg Nesterov <oleg@xxxxxxxxxx> · Wed, 25 Oct 2023 19:39:31 +0200

Hi Chuck,

Thanks for your reply. But I am already sleeping and I can't understand it.
So let me ask a couple of questions.

1. Do you agree that the current nfsd_copy_write_verifier() code makes no sense?

   I mean, the usage of read_seqbegin_or_lock() suggests that if the lockless
   pass fails it should take writeverf_lock for writing. But this can't happen,
   and thus this code doesn't look right no matter what. None of the
   read_seqbegin_or_lock/need_seqretry/done_seqretry helpers make any sense
   because "seq" is alway even.

2. If yes, which change do you prefer? I'd prefer the patch at the end.

Oleg.

On 10/25, Chuck Lever wrote:
>
> On Wed, Oct 25, 2023 at 06:30:06PM +0200, Oleg Nesterov wrote:
> > Hello,
> >
> > The usage of writeverf_lock is wrong and misleading no matter what and
> > I can not understand the intent.
>
> The structure of the seqlock was introduced in commit 27c438f53e79
> ("nfsd: Support the server resetting the boot verifier").
>
> The NFS write verifier is an 8-byte cookie that is supposed to
> indicate the boot epoch of the server -- simply put, when the server
> restarts, the epoch (and this verifier) changes.
>
> NFSv3 and later have a two-phase write scheme where the client
> sends data to the server (known as an UNSTABLE WRITE), then later
> asks the server to commit that data (a COMMIT). Before the COMMIT,
> that data is not durable and the client must hold onto it until
> the server's COMMIT Reply indicates it's safe for the client to
> discard that data and move on.
>
> When an UNSTABLE WRITE is done, the server reports its current
> epoch as part of each WRITE Reply. If this verifier cookie changes,
> the client knows that the server might have lost previously
> written written-but-uncommitted data, so it must send the WRITEs
> again in that (rare) case.
>
> NFSD abuses this slightly by changing the write verifier whenever
> there is an underlying local write error that might have occurred in
> the background (ie, there was no WRITE or COMMIT operation at the
> time that the server could use to convey the error back to the
> client). This is supposed to trigger clients to send UNSTABLE WRITEs
> again to ensure that data is properly committed to durable storage.
>
> The point of the seqlock is to ensure that
>
> a) a write verifier update does not tear the verifier
> b) a write verifier read does not see a torn verifier
>
> This is a hot path, so we don't want a full spinlock to achieve
> a) and b).
>
> Way back when, the verifier was updated by two separate 32-bit
> stores; hence the risk of tearing.
>
>
> > nfsd_copy_write_verifier() uses read_seqbegin_or_lock() incorrectly.
> > "seq" is always even, so read_seqbegin_or_lock() can never take the
> > lock for writing. We need to make the counter odd for the 2nd round:
> >
> > 	--- a/fs/nfsd/nfssvc.c
> > 	+++ b/fs/nfsd/nfssvc.c
> > 	@@ -359,11 +359,14 @@ static bool nfsd_needs_lockd(struct nfsd_net *nn)
> > 	  */
> > 	 void nfsd_copy_write_verifier(__be32 verf[2], struct nfsd_net *nn)
> > 	 {
> > 	-	int seq = 0;
> > 	+	int seq, nextseq = 0;
> >
> > 		do {
> > 	+		seq = nextseq;
> > 			read_seqbegin_or_lock(&nn->writeverf_lock, &seq);
> > 			memcpy(verf, nn->writeverf, sizeof(nn->writeverf));
> > 	+		/* If lockless access failed, take the lock. */
> > 	+		nextseq = 1;
> > 		} while (need_seqretry(&nn->writeverf_lock, seq));
> > 		done_seqretry(&nn->writeverf_lock, seq);
> > 	 }
> >
> > OTOH. This function just copies 8 bytes, this makes me think that it doesn't
> > need the conditional locking and read_seqbegin_or_lock() at all. So perhaps
> > the (untested) patch below makes more sense? Please note that it should not
> > change the current behaviour, it just makes the code look correct (and more
> > optimal but this is minor).
> >
> > Another question is why we can't simply turn nn->writeverf into seqcount_t.
> > I guess we can't because nfsd_reset_write_verifier() needs spin_lock() to
> > serialise with itself, right?
>
> "reset" is supposed to be very rare operation. Using a lock in that
> case is probably quite acceptable, as long as reading the verifier
> is wait-free and guaranteed to be untorn.
>
> But a seqcount_t is only 32 bits.
>
>
> > Oleg.
> > ---
> >
> > diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> > index c7af1095f6b5..094b765c5397 100644
> > --- a/fs/nfsd/nfssvc.c
> > +++ b/fs/nfsd/nfssvc.c
> > @@ -359,13 +359,12 @@ static bool nfsd_needs_lockd(struct nfsd_net *nn)
> >   */
> >  void nfsd_copy_write_verifier(__be32 verf[2], struct nfsd_net *nn)
> >  {
> > -	int seq = 0;
> > +	unsigned seq;
> >
> >  	do {
> > -		read_seqbegin_or_lock(&nn->writeverf_lock, &seq);
> > +		seq = read_seqbegin(&nn->writeverf_lock);
> >  		memcpy(verf, nn->writeverf, sizeof(nn->writeverf));
> > -	} while (need_seqretry(&nn->writeverf_lock, seq));
> > -	done_seqretry(&nn->writeverf_lock, seq);
> > +	} while (read_seqretry(&nn->writeverf_lock, seq));
> >  }
> >
> >  static void nfsd_reset_write_verifier_locked(struct nfsd_net *nn)
> >
>
> --
> Chuck Lever
>