On Wed, Mar 01, 2023 at 01:36:28PM +0100, Jan Kara wrote: > On Tue 28-02-23 12:58:07, Dave Chinner wrote: > > On Fri, Feb 24, 2023 at 07:46:57PM -0800, Darrick J. Wong wrote: > > > So xfs_dir2_sf_replace can rewrite the shortform structure (or even > > > convert it to block format!) while readdir is accessing it. Or am I > > > mising something? > > > > True, I missed that. > > > > Hmmmm. ISTR that holding ILOCK over filldir callbacks causes > > problems with lock ordering{1], and that's why we removed the ILOCK > > from the getdents path in the first place and instead relied on the > > IOLOCK being held by the VFS across readdir for exclusion against > > concurrent modification from the VFS. > > > > Yup, the current code only holds the ILOCK for the extent lookup and > > buffer read process, it drops it while it is walking the locked > > buffer and calling the filldir callback. Which is why we don't hold > > it for xfs_dir2_sf_getdents() - the VFS is supposed to be holding > > i_rwsem in exclusive mode for any operation that modifies a > > directory entry. We should only need the ILOCK for serialising the > > extent tree loading, not for serialising access vs modification to > > the directory. > > > > So, yeah, I think you're right, Darrick. And the fix is that the VFS > > needs to hold the i_rwsem correctly for allo inodes that may be > > modified during rename... > > But Al Viro didn't want to lock the inode in the VFS (as some filesystems > don't need the lock) Was any reason given? We know we have to modify the ".." entry of the child directory being moved, so I'd really like to understand why the locking rule of "directory i_rwsem must be held exclusively over modifications" so that we can use shared access for read operations has been waived for this specific case. Apart from exposing multiple filesystems to modifications racing with operations that hold the i_rwsem shared to *prevent concurrent directory modifications*, what performance or scalability benefit is seen as a result of eliding this inode lock from the VFS rename setup? This looks like a straight forward VFS level directory locking violation, and now we are playing whack-a-mole to fix it in each filesystem we discover that requires the child directory inode to be locked... > so in ext4 we ended up grabbing the lock in > ext4_rename() like: > > + /* > + * We need to protect against old.inode directory getting > + * converted from inline directory format into a normal one. > + */ > + inode_lock_nested(old.inode, I_MUTEX_NONDIR2); Why are you using the I_MUTEX_NONDIR2 annotation when locking a directory inode? That doesn't seem right. Further, how do we guarantee correct i_rwsem lock ordering against the all the other inodes that the VFS has already locked and/or other multi-inode i_rwsem locking primitives in the VFS? Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx