sorry for the noise, forgot to mention. I personally don't care about nfsd_copy_write_verifier(), and this code doesn't look really buggy. I am trying to audit the users of read_seqbegin_or_lock(), see https://lore.kernel.org/all/20231024120808.GA15382@xxxxxxxxxx/ On 10/25, Oleg Nesterov wrote: > > Hi Chuck, > > Thanks for your reply. But I am already sleeping and I can't understand it. > So let me ask a couple of questions. > > 1. Do you agree that the current nfsd_copy_write_verifier() code makes no sense? > > I mean, the usage of read_seqbegin_or_lock() suggests that if the lockless > pass fails it should take writeverf_lock for writing. But this can't happen, > and thus this code doesn't look right no matter what. None of the > read_seqbegin_or_lock/need_seqretry/done_seqretry helpers make any sense > because "seq" is alway even. > > 2. If yes, which change do you prefer? I'd prefer the patch at the end. > > Oleg. > > On 10/25, Chuck Lever wrote: > > > > On Wed, Oct 25, 2023 at 06:30:06PM +0200, Oleg Nesterov wrote: > > > Hello, > > > > > > The usage of writeverf_lock is wrong and misleading no matter what and > > > I can not understand the intent. > > > > The structure of the seqlock was introduced in commit 27c438f53e79 > > ("nfsd: Support the server resetting the boot verifier"). > > > > The NFS write verifier is an 8-byte cookie that is supposed to > > indicate the boot epoch of the server -- simply put, when the server > > restarts, the epoch (and this verifier) changes. > > > > NFSv3 and later have a two-phase write scheme where the client > > sends data to the server (known as an UNSTABLE WRITE), then later > > asks the server to commit that data (a COMMIT). Before the COMMIT, > > that data is not durable and the client must hold onto it until > > the server's COMMIT Reply indicates it's safe for the client to > > discard that data and move on. > > > > When an UNSTABLE WRITE is done, the server reports its current > > epoch as part of each WRITE Reply. If this verifier cookie changes, > > the client knows that the server might have lost previously > > written written-but-uncommitted data, so it must send the WRITEs > > again in that (rare) case. > > > > NFSD abuses this slightly by changing the write verifier whenever > > there is an underlying local write error that might have occurred in > > the background (ie, there was no WRITE or COMMIT operation at the > > time that the server could use to convey the error back to the > > client). This is supposed to trigger clients to send UNSTABLE WRITEs > > again to ensure that data is properly committed to durable storage. > > > > The point of the seqlock is to ensure that > > > > a) a write verifier update does not tear the verifier > > b) a write verifier read does not see a torn verifier > > > > This is a hot path, so we don't want a full spinlock to achieve > > a) and b). > > > > Way back when, the verifier was updated by two separate 32-bit > > stores; hence the risk of tearing. > > > > > > > nfsd_copy_write_verifier() uses read_seqbegin_or_lock() incorrectly. > > > "seq" is always even, so read_seqbegin_or_lock() can never take the > > > lock for writing. We need to make the counter odd for the 2nd round: > > > > > > --- a/fs/nfsd/nfssvc.c > > > +++ b/fs/nfsd/nfssvc.c > > > @@ -359,11 +359,14 @@ static bool nfsd_needs_lockd(struct nfsd_net *nn) > > > */ > > > void nfsd_copy_write_verifier(__be32 verf[2], struct nfsd_net *nn) > > > { > > > - int seq = 0; > > > + int seq, nextseq = 0; > > > > > > do { > > > + seq = nextseq; > > > read_seqbegin_or_lock(&nn->writeverf_lock, &seq); > > > memcpy(verf, nn->writeverf, sizeof(nn->writeverf)); > > > + /* If lockless access failed, take the lock. */ > > > + nextseq = 1; > > > } while (need_seqretry(&nn->writeverf_lock, seq)); > > > done_seqretry(&nn->writeverf_lock, seq); > > > } > > > > > > OTOH. This function just copies 8 bytes, this makes me think that it doesn't > > > need the conditional locking and read_seqbegin_or_lock() at all. So perhaps > > > the (untested) patch below makes more sense? Please note that it should not > > > change the current behaviour, it just makes the code look correct (and more > > > optimal but this is minor). > > > > > > Another question is why we can't simply turn nn->writeverf into seqcount_t. > > > I guess we can't because nfsd_reset_write_verifier() needs spin_lock() to > > > serialise with itself, right? > > > > "reset" is supposed to be very rare operation. Using a lock in that > > case is probably quite acceptable, as long as reading the verifier > > is wait-free and guaranteed to be untorn. > > > > But a seqcount_t is only 32 bits. > > > > > > > Oleg. > > > --- > > > > > > diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c > > > index c7af1095f6b5..094b765c5397 100644 > > > --- a/fs/nfsd/nfssvc.c > > > +++ b/fs/nfsd/nfssvc.c > > > @@ -359,13 +359,12 @@ static bool nfsd_needs_lockd(struct nfsd_net *nn) > > > */ > > > void nfsd_copy_write_verifier(__be32 verf[2], struct nfsd_net *nn) > > > { > > > - int seq = 0; > > > + unsigned seq; > > > > > > do { > > > - read_seqbegin_or_lock(&nn->writeverf_lock, &seq); > > > + seq = read_seqbegin(&nn->writeverf_lock); > > > memcpy(verf, nn->writeverf, sizeof(nn->writeverf)); > > > - } while (need_seqretry(&nn->writeverf_lock, seq)); > > > - done_seqretry(&nn->writeverf_lock, seq); > > > + } while (read_seqretry(&nn->writeverf_lock, seq)); > > > } > > > > > > static void nfsd_reset_write_verifier_locked(struct nfsd_net *nn) > > > > > > > -- > > Chuck Lever > >