On Wed, 29 May 2024, Trond Myklebust wrote: > On Tue, 2024-05-28 at 11:19 +1000, NeilBrown wrote: > > On Tue, 28 May 2024, Trond Myklebust wrote: > > > On Mon, 2024-05-27 at 13:04 +1000, NeilBrown wrote: > > > > > > > > dentry->d_fsdata is set to NFS_FSDATA_BLOCKED while unlinking or > > > > renaming-over a file to ensure that no open succeeds while the > > > > NFS > > > > operation progressed on the server. > > > > > > > > Setting dentry->d_fsdata to NFS_FSDATA_BLOCKED is done under - > > > > >d_lock > > > > after checking the refcount is not elevated. Any attempt to open > > > > the > > > > file (through that name) will go through lookp_open() which will > > > > take > > > > ->d_lock while incrementing the refcount, we can be sure that > > > > once > > > > the > > > > new value is set, __nfs_lookup_revalidate() *will* see the new > > > > value > > > > and > > > > will block. > > > > > > > > We don't have any locking guarantee that when we set ->d_fsdata > > > > to > > > > NULL, > > > > the wait_var_event() in __nfs_lookup_revalidate() will notice. > > > > wait/wake primitives do NOT provide barriers to guarantee order. > > > > We > > > > must use smp_load_acquire() in wait_var_event() to ensure we look > > > > at > > > > an > > > > up-to-date value, and must use smp_store_release() before > > > > wake_up_var(). > > > > > > > > This patch adds those barrier functions and factors out > > > > block_revalidate() and unblock_revalidate() far clarity. > > > > > > > > There is also a hypothetical bug in that if memory allocation > > > > fails > > > > (which never happens in practice) we might leave ->d_fsdata > > > > locked. > > > > This patch adds the missing call to unblock_revalidate(). > > > > > > > > Reported-and-tested-by: Richard Kojedzinszky > > > > <richard+debian+bugreport@xxxxxxxxx> > > > > Closes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071501 > > > > Fixes: 3c59366c207e ("NFS: don't unhash dentry during > > > > unlink/rename") > > > > Signed-off-by: NeilBrown <neilb@xxxxxxx> > > > > --- > > > > fs/nfs/dir.c | 44 +++++++++++++++++++++++++++++--------------- > > > > 1 file changed, 29 insertions(+), 15 deletions(-) > > > > > > > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c > > > > index ac505671efbd..c91dc36d41cc 100644 > > > > --- a/fs/nfs/dir.c > > > > +++ b/fs/nfs/dir.c > > > > @@ -1802,9 +1802,10 @@ __nfs_lookup_revalidate(struct dentry > > > > *dentry, > > > > unsigned int flags, > > > > if (parent != READ_ONCE(dentry->d_parent)) > > > > return -ECHILD; > > > > } else { > > > > - /* Wait for unlink to complete */ > > > > + /* Wait for unlink to complete - see > > > > unblock_revalidate() */ > > > > wait_var_event(&dentry->d_fsdata, > > > > - dentry->d_fsdata != > > > > NFS_FSDATA_BLOCKED); > > > > + smp_load_acquire(&dentry- > > > > >d_fsdata) > > > > + != NFS_FSDATA_BLOCKED); > > > > > > Doesn't this end up being a reversed ACQUIRE+RELEASE as described > > > in > > > the "LOCK ACQUISITION FUNCTIONS" section of Documentation/memory- > > > barriers.txt? > > > > I don't think so. That section is talking about STORE operations > > which > > can move backwards through ACQUIRE and forwards through RELEASE. > > > > Above we have a LOAD operation which mustn't move backwards through > > ACQUIRE. Below there is a STORE operation which mustn't move > > forwards > > through a RELEASE. Both of those are guaranteed. > > It isn't necessary for the LOAD to move backwards through the ACQUIRE. > As I understand it, the point is that the ACQUIRE can move through the > RELEASE as per the following paragraph in that document: > > Similarly, the reverse case of a RELEASE followed by an ACQUIRE does > not imply a full memory barrier. Therefore, the CPU's execution of the > critical sections corresponding to the RELEASE and the ACQUIRE can cross, > so that: > > *A = a; > RELEASE M > ACQUIRE N > *B = b; > > could occur as: > > ACQUIRE N, STORE *B, STORE *A, RELEASE M On the wakeup side we have: STORE ->d_fsdata RELEASE spin_lock LOAD: check is waitq is empty and on the wait side we have STORE: add to waitq spin_unlock ACQUIRE LOAD ->d_fsdata I believe that spin_lock is an ACQUIRE operation and spin_unlock is a RELEASE operation. So both of these have "ACQUIRE ; RELEASE" which creates a full barrier. > > This would presumably be why the function clear_and_wake_up_bit() needs > a full memory barrier on most architectures, instead of being just an > smp_wmb(). Is my understanding of this wrong? clear_and_wake_up_bit() calls __wake_up_bit() which calls waitqueue_active() without taking a lock. So there is no ACQUIRE after the RELEASE in clear_bit_unlock() and before testing for list-empty. So we need to explicitly add a barrier. At least that is my understanding just now. I hadn't worked through all the details before, but I'm glad you prompted me to - thanks. NeilBrown