> > Instead of proof by contradiction, this proof more intuitively explains > > the order that the locking rules want to enforce. If the developers want > > to update the locking rules, they would be forced to come up with the > > corresponding order that justifies deadlock-freedom, otherwise, the > > proof won't pass. As a evidence of effectiveness, the authors of RefFS > > paper firstly report the bug from commit 28eceeda130f ("fs: Lock moved > > directories") through this formal approach. > > Any comments on this proof? BTW, it would be okay to keep both the > by-contradiction proof and by-order proof as both proofs may find their > audience. Continuously seeking for comments. Mo Zou <lostzoumo@xxxxxxxxx> 于2024年4月30日周二 09:44写道: > > > Instead of proof by contradiction, this proof more intuitively explains > > the order that the locking rules want to enforce. If the developers want > > to update the locking rules, they would be forced to come up with the > > corresponding order that justifies deadlock-freedom, otherwise, the > > proof won't pass. As a evidence of effectiveness, the authors of RefFS > > paper firstly report the bug from commit 28eceeda130f ("fs: Lock moved > > directories") through this formal approach. > > Any comments on this proof? BTW, it would be okay to keep both the > by-contradiction proof and by-order proof as both proofs may find their > audience. > > Mo Zou <lostzoumo@xxxxxxxxx> 于2024年4月13日周六 00:10写道: > > > > The directory locking proof should and could be made more formal. > > Specifically, the locking rules for rename are very intricate and > > subtle. The current proof shows deadlock-freedom by contradiction, i.e., > > if a deadlock is possible, where could go wrong. There is no denying > > that the proof is detailed with all possible cases considered. However, > > it gives less intuition on why the locking rules are correct and may be > > hard to maintain. Developers and even experts may still make mistakes > > without realizing that the proof is no longer correct. See commit > > 28eceeda130f ("fs: Lock moved directories") for such a case. > > > > A recent academia paper RefFS from OSDI 2024 has proposed a formal way > > to prove the correctness of locking scheme with machine-checkable proof > > (see https://ipads.se.sjtu.edu.cn/projects/reffs for a preprint of the > > paper). The core idea is to formally define the locking order of the > > rules. Actually, the existing proof has the same idea but fails to > > define the internal ranking for directories. The difficulty is that the > > ranking for directories is not static but dynamic. In this patch, we > > follow the idea from the paper to dynamically define the internal > > ranking of directories and shows that operations always take directories > > in order of increasing rank so that deadlocks are never possible. > > Instead of proof by contradiction, this proof more intuitively explains > > the order that the locking rules want to enforce. If the developers want > > to update the locking rules, they would be forced to come up with the > > corresponding order that justifies deadlock-freedom, otherwise, the > > proof won't pass. As a evidence of effectiveness, the authors of RefFS > > paper firstly report the bug from commit 28eceeda130f ("fs: Lock moved > > directories") through this formal approach. > > > > Signed-off-by: Mo Zou <lostzoumo@xxxxxxxxx> > > --- > > .../filesystems/directory-locking.rst | 224 +++++++++--------- > > 1 file changed, 113 insertions(+), 111 deletions(-) > > > > diff --git a/Documentation/filesystems/directory-locking.rst b/Documentation/filesystems/directory-locking.rst > > index 05ea387bc9fb..94e358d71986 100644 > > --- a/Documentation/filesystems/directory-locking.rst > > +++ b/Documentation/filesystems/directory-locking.rst > > @@ -44,7 +44,7 @@ For our purposes all operations fall in 6 classes: > > * decide which of the source and target need to be locked. > > The source needs to be locked if it's a non-directory, target - if it's > > a non-directory or about to be removed. > > - * take the locks that need to be taken (exlusive), in inode pointer order > > + * take the locks that need to be taken (exclusive), in inode pointer order > > if need to take both (that can happen only when both source and target > > are non-directories - the source because it wouldn't need to be locked > > otherwise and the target because mixing directory and non-directory is > > @@ -135,12 +135,13 @@ If no directory is its own ancestor, the scheme above is deadlock-free. > > Proof: > > > > There is a ranking on the locks, such that all primitives take > > -them in order of non-decreasing rank. Namely, > > +them in order of increasing rank. Namely, > > > > * rank ->i_rwsem of non-directories on given filesystem in inode pointer > > order. > > - * put ->i_rwsem of all directories on a filesystem at the same rank, > > - lower than ->i_rwsem of any non-directory on the same filesystem. > > + * put ->i_rwsem of all directories on a filesystem lower than ->i_rwsem > > + of any non-directory on the same filesystem and the internal ranking > > + of directories are defined later. > > * put ->s_vfs_rename_mutex at rank lower than that of any ->i_rwsem > > on the same filesystem. > > * among the locks on different filesystems use the relative > > @@ -149,92 +150,120 @@ them in order of non-decreasing rank. Namely, > > For example, if we have NFS filesystem caching on a local one, we have > > > > 1. ->s_vfs_rename_mutex of NFS filesystem > > - 2. ->i_rwsem of directories on that NFS filesystem, same rank for all > > + 2. ->i_rwsem of directories on that NFS filesystem, internal ranking > > + defined later > > 3. ->i_rwsem of non-directories on that filesystem, in order of > > increasing address of inode > > 4. ->s_vfs_rename_mutex of local filesystem > > - 5. ->i_rwsem of directories on the local filesystem, same rank for all > > + 5. ->i_rwsem of directories on the local filesystem, internal ranking > > + defined later > > 6. ->i_rwsem of non-directories on local filesystem, in order of > > increasing address of inode. > > > > -It's easy to verify that operations never take a lock with rank > > -lower than that of an already held lock. > > - > > -Suppose deadlocks are possible. Consider the minimal deadlocked > > -set of threads. It is a cycle of several threads, each blocked on a lock > > -held by the next thread in the cycle. > > - > > -Since the locking order is consistent with the ranking, all > > -contended locks in the minimal deadlock will be of the same rank, > > -i.e. they all will be ->i_rwsem of directories on the same filesystem. > > -Moreover, without loss of generality we can assume that all operations > > -are done directly to that filesystem and none of them has actually > > -reached the method call. > > - > > -In other words, we have a cycle of threads, T1,..., Tn, > > -and the same number of directories (D1,...,Dn) such that > > - > > - T1 is blocked on D1 which is held by T2 > > - > > - T2 is blocked on D2 which is held by T3 > > - > > - ... > > - > > - Tn is blocked on Dn which is held by T1. > > - > > -Each operation in the minimal cycle must have locked at least > > -one directory and blocked on attempt to lock another. That leaves > > -only 3 possible operations: directory removal (locks parent, then > > -child), same-directory rename killing a subdirectory (ditto) and > > -cross-directory rename of some sort. > > - > > -There must be a cross-directory rename in the set; indeed, > > -if all operations had been of the "lock parent, then child" sort > > -we would have Dn a parent of D1, which is a parent of D2, which is > > -a parent of D3, ..., which is a parent of Dn. Relationships couldn't > > -have changed since the moment directory locks had been acquired, > > -so they would all hold simultaneously at the deadlock time and > > -we would have a loop. > > - > > -Since all operations are on the same filesystem, there can't be > > -more than one cross-directory rename among them. Without loss of > > -generality we can assume that T1 is the one doing a cross-directory > > -rename and everything else is of the "lock parent, then child" sort. > > - > > -In other words, we have a cross-directory rename that locked > > -Dn and blocked on attempt to lock D1, which is a parent of D2, which is > > -a parent of D3, ..., which is a parent of Dn. Relationships between > > -D1,...,Dn all hold simultaneously at the deadlock time. Moreover, > > -cross-directory rename does not get to locking any directories until it > > -has acquired filesystem lock and verified that directories involved have > > -a common ancestor, which guarantees that ancestry relationships between > > -all of them had been stable. > > - > > -Consider the order in which directories are locked by the > > -cross-directory rename; parents first, then possibly their children. > > -Dn and D1 would have to be among those, with Dn locked before D1. > > -Which pair could it be? > > - > > -It can't be the parents - indeed, since D1 is an ancestor of Dn, > > -it would be the first parent to be locked. Therefore at least one of the > > -children must be involved and thus neither of them could be a descendent > > -of another - otherwise the operation would not have progressed past > > -locking the parents. > > - > > -It can't be a parent and its child; otherwise we would've had > > -a loop, since the parents are locked before the children, so the parent > > -would have to be a descendent of its child. > > - > > -It can't be a parent and a child of another parent either. > > -Otherwise the child of the parent in question would've been a descendent > > -of another child. > > - > > -That leaves only one possibility - namely, both Dn and D1 are > > -among the children, in some order. But that is also impossible, since > > -neither of the children is a descendent of another. > > - > > -That concludes the proof, since the set of operations with the > > -properties requiered for a minimal deadlock can not exist. > > +.. _RefFS: https://ipads.se.sjtu.edu.cn/projects/reffs > > + > > +The ranking above is mostly static except the internal ranking of > > +directories. Below we will focus on internal ranking of directories on > > +the same filesystem and present a way to dynamically define ranks of > > +directories such that operations still take locks in order of > > +increasing rank, i.e., operations always take a lock with rank higher > > +than that of any already held locks. Note that the idea of dynamic > > +ranking has a formal basis and more details can be found in the > > +academia paper RefFS_ from OSDI 2024. > > + > > + > > +First, without considering cross-directory renames, a directory's rank > > +is defined as its distance from the root, i.e., root directory has rank > > +0 and each child directory's rank equals one plus its parent > > +directory's rank. Because no directory is its own ancestor, every > > +directory's rank can be uniquely determined starting from root to its > > +descendants. This ranking accounts for the (locks parent, then child) > > +nested locking in directory removal and same-directory rename. > > + > > +Then, consider the most complex case in a cross-directory rename, where the > > +most locks need to be acquired. The rename may perform the following > > +nested locking (only show lock statements for convenience). Here, dir1 > > +and dir2 are the two parents and according to the locking rules, we > > +know dir2 is not ancestor to dir1. child1 and child2 are source and > > +target in a order according to the rules above and may not necessarily > > +correspond to the child of dir1 and dir2. Ignore things in angle > > +brackets for now. The locking order here can not be predetermined. For > > +instance, rename(/a/x,/b/y) and rename(/b/y,/a/x) would acquire /a and > > +/b in different order because neither parent is an ancestor of the > > +other and we lock the parent of source first. But the observation is > > +that after acquiring the filesystem lock, the precise locking order > > +becomes known. We introduce a ghost state edge to help define the > > +ranks (ghost state is a commonly used technique in verification which > > +has no influence on the program execution and its only role is to > > +assist the proof). > > + > > + 1. lock filesystem (->s_vfs_rename_mutex); > > + 2. lock dir1->i_rwsem; <set edge to (dir1, dir2)> > > + 3. lock dir2->i_rwsem; <set edge to (non-parent, child1) if child1 is > > + a directory and non-parent is the one in dir1 and dir2 that is not > > + parent to child1, otherwise, set edge to None> > > + 4. lock child1->i_rwsem; <set edge to (child1, child2) if chil1 and > > + child2 are directories, otherwise, set edge to None> > > + 5. lock child2->i_rwsem; <set edge to None> > > + > > +The value of edge is either None or (inode1,inode2) > > +representing a possible lock dependency from inode1 to inode2, i.e., > > +some thread may request lock of inode2 with lock of inode1 held. There > > +is only ever one edge value in a filesystem with edge set to None when > > +the s_vfs_rename_mutex is not held and edge set to a determined value > > +when s_vfs_rename_mutex is held. When edge is None, the ranking rules > > +are the same as before (root is 0 and child is one plus its parent). > > +When edge is (inode1, inode2), only the ranking rule for inode2 > > +changes: inode2's rank is one plus the larger rank of inode2's parent > > +and inode1. Intuitively, a directory's rank is the longest distance > > +from root in the filesystem tree extended with edge. > > + > > +Now let's check the setting of edge in angle brackets in above code to > > +see how it ensures cross-directory rename takes locks in increasing > > +order in this case. After the lock statement at line 2, edge is set to > > +(dir1,dir2). Because dir2 is not ancestor to dir1, the rank of dir1 > > +stays the same and the rank of dir2 (as well as every other > > +directories) can be uniquely determined. So at line 3, dir2's rank is > > +higher than dir1's. Before the code acquires child1 at line 4, it > > +already holds dir1 and dir2. We know one of dir1 and dir2 is parent to > > +child1, so we update edge to (the non-parent of child1, child1). > > +Because child1 cannot be ancestor to dir1 or dir2, every directory's > > +rank is still uniquely defined (for the same reason as above). So at > > +line 4, we ensure child1 has higher rank than dir1 and dir2. Similarly, > > +we update edge to (child1, child2) so that at line 5, child2 has higher > > +rank than held locks (dir1, dir2 and child1). We reset edge to None > > +whenever we have acquired all directory locks. > > + > > +In a less complex case of cross-directory rename, the code may execute > > +a prefix of above lines and we always reset edge to None after having > > +acquired all directory locks. > > + > > +The ranking of directories may change under following cases: > > + > > + * directory removal/creation: a directory's rank is removed/added > > + * rename: the effect of rename is either (1) moving a subtree under a > > + node that is outside of the subtree (the directories in the subtree are > > + all recalculated based on the node whose rank is not influenced) > > + and potentially removing target (a directory's rank is removed) or > > + (2) in case of RENAME_EXCHANGE, two subtrees exchange their > > + positions. Because source is not a descendent of the target and > > + target is not a descendent of source, the parents of the two > > + subtrees cannot not be in these subtrees so the ranks of parents > > + are not influenced. The two subtrees' ranks are recalculated based > > + on their parents > > + * edge updates in a cross-directory rename: setting edge to > > + (inode1,inode2) or resetting edge to None modify the ranks of the > > + inode2 subtree > > + > > +Initially in a filesystem, no directory is its own ancestor and > > +directories' ranks can be uniquely determined. All these changes > > +preserve the fact that at every time, the ranking of every directory > > +can be uniquely determined. Thus no directory is its own ancestor > > +(because if a directory is its own ancestor, its rank cannot be > > +determined) and we prove the premise. Plus the fact that operations > > +always take locks in order of increasing rank (can be easily verified). > > +We can conclude the locking rules above are deadlock-free. > > + > > > > Note that the check for having a common ancestor in cross-directory > > rename is crucial - without it a deadlock would be possible. Indeed, > > @@ -247,33 +276,6 @@ distant ancestor. Add a bunch of rmdir() attempts on all directories > > in between (all of those would fail with -ENOTEMPTY, had they ever gotten > > the locks) and voila - we have a deadlock. > > > > -Loop avoidance > > -============== > > - > > -These operations are guaranteed to avoid loop creation. Indeed, > > -the only operation that could introduce loops is cross-directory rename. > > -Suppose after the operation there is a loop; since there hadn't been such > > -loops before the operation, at least on of the nodes in that loop must've > > -had its parent changed. In other words, the loop must be passing through > > -the source or, in case of exchange, possibly the target. > > - > > -Since the operation has succeeded, neither source nor target could have > > -been ancestors of each other. Therefore the chain of ancestors starting > > -in the parent of source could not have passed through the target and > > -vice versa. On the other hand, the chain of ancestors of any node could > > -not have passed through the node itself, or we would've had a loop before > > -the operation. But everything other than source and target has kept > > -the parent after the operation, so the operation does not change the > > -chains of ancestors of (ex-)parents of source and target. In particular, > > -those chains must end after a finite number of steps. > > - > > -Now consider the loop created by the operation. It passes through either > > -source or target; the next node in the loop would be the ex-parent of > > -target or source resp. After that the loop would follow the chain of > > -ancestors of that parent. But as we have just shown, that chain must > > -end after a finite number of steps, which means that it can't be a part > > -of any loop. Q.E.D. > > - > > While this locking scheme works for arbitrary DAGs, it relies on > > ability to check that directory is a descendent of another object. Current > > implementation assumes that directory graph is a tree. This assumption is > > -- > > 2.30.1 (Apple Git-130) > >