On Wed, May 24, 2023 at 06:35:04PM +0200, Jan Kara wrote: > Hello! > > This is again about the problem with directory renames I've already > reported in [1]. To quickly sum it up some filesystems (so far we know at > least about xfs, ext4, udf, reiserfs) need to lock the directory when it is > being renamed into another directory. This is because we need to update the > parent pointer in the directory in that case and if that races with other > operation on the directory, bad things can happen. > > So far we've done the locking in the filesystem code but recently Darrick > pointed out [2] that we've missed the RENAME_EXCHANGE case in our ext4 fix. > That one is particularly nasty because RENAME_EXCHANGE can arbitrarily mix > regular files and directories. Couple nasty arising cases: > > 1) We need to additionally lock two exchanged directories. Suppose a > situation like: > > mkdir P; mkdir P/A; mkdir P/B; touch P/B/F > > CPU1 CPU2 > renameat2("P/A", "P/B", RENAME_EXCHANGE); renameat2("P/B/F", "P/A", 0); > > Both operations need to lock A and B directories which are unrelated in the > tree. This means we must establish stable lock ordering on directory locks > even for the case when they are not in ancestor relationship. > > 2) We may need to lock a directory and a non-directory and they can be in > parent-child relationship when hardlinks are involved: > > mkdir A; mkdir B; touch A/F; ln A/F B/F > renameat2("A/F", "B"); > > And this is really nasty because we don't have a way to find out whether > "A/F" and "B" are in any relationship - in particular whether B happens to > be another parent of A/F or not. > > What I've decided to do is to make sure we always lock directory first in > this mixed case and that *should* avoid all the deadlocks but I'm spelling > this out here just in case people can think of some even more wicked case > before I'll send patches. I can think of a few things: Not taking i_rwsem when updating the dotdot entry as part of moving child directories means that readdir can race with a dotdot update to a shortformat XFS directory. That currently doesn't get too ugly because we reserve a full 8 bytes for the shortform dotdot entry's inode (which means we are not at risk of changing the format). As Dave has previously noted, xfs synchronizes on the ILOCK for metadata updates. This also makes online directory fsck harder because any time we have to cycle the directory's ILOCK means we have to revalidate the directory contents afterwards or hook the directory update code so that we can learn about changes that happened while the ILOCK was dropped. (I'm sure the rest of you are thinking "that's some crazy xfs thing"). Things get weirder with the new xfs parent pointers feature -- rename doesn't take i_rwsem of the children, which means that the only synchronization point is with the ILOCK held. Parent pointers are stored in a special xattr namespace. This makes me nervous about collisions with a getxattr call, since those can be done with or without i_rwsem held at all. I think the ILOCK saves us in this case, since I've been testing it for a few months now without any corruption problems. Obviously, online fsck of parent pointers has to employ the same revalidate/hook to deal with ILOCK cycling. But again, that's "some crazy xfs thing". I wouldn't mind the VFS taking i_rwsem on the children to simplify online fsck, but I've got a reasonable enough workaround. --D > Also I wanted to ask (Miklos in particular as RENAME_EXCHANGE author): Why > do we lock non-directories in RENAME_EXCHANGE case? If we didn't have to do > that things would be somewhat simpler... > > Honza > > [1] https://lore.kernel.org/all/20230117123735.un7wbamlbdihninm@quack3 > [2] https://lore.kernel.org/all/20230517045836.GA11594@frogsfrogsfrogs > > -- > Jan Kara <jack@xxxxxxxx> > SUSE Labs, CR