> Instead of proof by contradiction, this proof more intuitively explains > the order that the locking rules want to enforce. If the developers want > to update the locking rules, they would be forced to come up with the > corresponding order that justifies deadlock-freedom, otherwise, the > proof won't pass. As a evidence of effectiveness, the authors of RefFS > paper firstly report the bug from commit 28eceeda130f ("fs: Lock moved > directories") through this formal approach. Any comments on this proof? BTW, it would be okay to keep both the by-contradiction proof and by-order proof as both proofs may find their audience. Mo Zou <lostzoumo@xxxxxxxxx> 于2024年4月13日周六 00:10写道: > > The directory locking proof should and could be made more formal. > Specifically, the locking rules for rename are very intricate and > subtle. The current proof shows deadlock-freedom by contradiction, i.e., > if a deadlock is possible, where could go wrong. There is no denying > that the proof is detailed with all possible cases considered. However, > it gives less intuition on why the locking rules are correct and may be > hard to maintain. Developers and even experts may still make mistakes > without realizing that the proof is no longer correct. See commit > 28eceeda130f ("fs: Lock moved directories") for such a case. > > A recent academia paper RefFS from OSDI 2024 has proposed a formal way > to prove the correctness of locking scheme with machine-checkable proof > (see https://ipads.se.sjtu.edu.cn/projects/reffs for a preprint of the > paper). The core idea is to formally define the locking order of the > rules. Actually, the existing proof has the same idea but fails to > define the internal ranking for directories. The difficulty is that the > ranking for directories is not static but dynamic. In this patch, we > follow the idea from the paper to dynamically define the internal > ranking of directories and shows that operations always take directories > in order of increasing rank so that deadlocks are never possible. > Instead of proof by contradiction, this proof more intuitively explains > the order that the locking rules want to enforce. If the developers want > to update the locking rules, they would be forced to come up with the > corresponding order that justifies deadlock-freedom, otherwise, the > proof won't pass. As a evidence of effectiveness, the authors of RefFS > paper firstly report the bug from commit 28eceeda130f ("fs: Lock moved > directories") through this formal approach. > > Signed-off-by: Mo Zou <lostzoumo@xxxxxxxxx> > --- > .../filesystems/directory-locking.rst | 224 +++++++++--------- > 1 file changed, 113 insertions(+), 111 deletions(-) > > diff --git a/Documentation/filesystems/directory-locking.rst b/Documentation/filesystems/directory-locking.rst > index 05ea387bc9fb..94e358d71986 100644 > --- a/Documentation/filesystems/directory-locking.rst > +++ b/Documentation/filesystems/directory-locking.rst > @@ -44,7 +44,7 @@ For our purposes all operations fall in 6 classes: > * decide which of the source and target need to be locked. > The source needs to be locked if it's a non-directory, target - if it's > a non-directory or about to be removed. > - * take the locks that need to be taken (exlusive), in inode pointer order > + * take the locks that need to be taken (exclusive), in inode pointer order > if need to take both (that can happen only when both source and target > are non-directories - the source because it wouldn't need to be locked > otherwise and the target because mixing directory and non-directory is > @@ -135,12 +135,13 @@ If no directory is its own ancestor, the scheme above is deadlock-free. > Proof: > > There is a ranking on the locks, such that all primitives take > -them in order of non-decreasing rank. Namely, > +them in order of increasing rank. Namely, > > * rank ->i_rwsem of non-directories on given filesystem in inode pointer > order. > - * put ->i_rwsem of all directories on a filesystem at the same rank, > - lower than ->i_rwsem of any non-directory on the same filesystem. > + * put ->i_rwsem of all directories on a filesystem lower than ->i_rwsem > + of any non-directory on the same filesystem and the internal ranking > + of directories are defined later. > * put ->s_vfs_rename_mutex at rank lower than that of any ->i_rwsem > on the same filesystem. > * among the locks on different filesystems use the relative > @@ -149,92 +150,120 @@ them in order of non-decreasing rank. Namely, > For example, if we have NFS filesystem caching on a local one, we have > > 1. ->s_vfs_rename_mutex of NFS filesystem > - 2. ->i_rwsem of directories on that NFS filesystem, same rank for all > + 2. ->i_rwsem of directories on that NFS filesystem, internal ranking > + defined later > 3. ->i_rwsem of non-directories on that filesystem, in order of > increasing address of inode > 4. ->s_vfs_rename_mutex of local filesystem > - 5. ->i_rwsem of directories on the local filesystem, same rank for all > + 5. ->i_rwsem of directories on the local filesystem, internal ranking > + defined later > 6. ->i_rwsem of non-directories on local filesystem, in order of > increasing address of inode. > > -It's easy to verify that operations never take a lock with rank > -lower than that of an already held lock. > - > -Suppose deadlocks are possible. Consider the minimal deadlocked > -set of threads. It is a cycle of several threads, each blocked on a lock > -held by the next thread in the cycle. > - > -Since the locking order is consistent with the ranking, all > -contended locks in the minimal deadlock will be of the same rank, > -i.e. they all will be ->i_rwsem of directories on the same filesystem. > -Moreover, without loss of generality we can assume that all operations > -are done directly to that filesystem and none of them has actually > -reached the method call. > - > -In other words, we have a cycle of threads, T1,..., Tn, > -and the same number of directories (D1,...,Dn) such that > - > - T1 is blocked on D1 which is held by T2 > - > - T2 is blocked on D2 which is held by T3 > - > - ... > - > - Tn is blocked on Dn which is held by T1. > - > -Each operation in the minimal cycle must have locked at least > -one directory and blocked on attempt to lock another. That leaves > -only 3 possible operations: directory removal (locks parent, then > -child), same-directory rename killing a subdirectory (ditto) and > -cross-directory rename of some sort. > - > -There must be a cross-directory rename in the set; indeed, > -if all operations had been of the "lock parent, then child" sort > -we would have Dn a parent of D1, which is a parent of D2, which is > -a parent of D3, ..., which is a parent of Dn. Relationships couldn't > -have changed since the moment directory locks had been acquired, > -so they would all hold simultaneously at the deadlock time and > -we would have a loop. > - > -Since all operations are on the same filesystem, there can't be > -more than one cross-directory rename among them. Without loss of > -generality we can assume that T1 is the one doing a cross-directory > -rename and everything else is of the "lock parent, then child" sort. > - > -In other words, we have a cross-directory rename that locked > -Dn and blocked on attempt to lock D1, which is a parent of D2, which is > -a parent of D3, ..., which is a parent of Dn. Relationships between > -D1,...,Dn all hold simultaneously at the deadlock time. Moreover, > -cross-directory rename does not get to locking any directories until it > -has acquired filesystem lock and verified that directories involved have > -a common ancestor, which guarantees that ancestry relationships between > -all of them had been stable. > - > -Consider the order in which directories are locked by the > -cross-directory rename; parents first, then possibly their children. > -Dn and D1 would have to be among those, with Dn locked before D1. > -Which pair could it be? > - > -It can't be the parents - indeed, since D1 is an ancestor of Dn, > -it would be the first parent to be locked. Therefore at least one of the > -children must be involved and thus neither of them could be a descendent > -of another - otherwise the operation would not have progressed past > -locking the parents. > - > -It can't be a parent and its child; otherwise we would've had > -a loop, since the parents are locked before the children, so the parent > -would have to be a descendent of its child. > - > -It can't be a parent and a child of another parent either. > -Otherwise the child of the parent in question would've been a descendent > -of another child. > - > -That leaves only one possibility - namely, both Dn and D1 are > -among the children, in some order. But that is also impossible, since > -neither of the children is a descendent of another. > - > -That concludes the proof, since the set of operations with the > -properties requiered for a minimal deadlock can not exist. > +.. _RefFS: https://ipads.se.sjtu.edu.cn/projects/reffs > + > +The ranking above is mostly static except the internal ranking of > +directories. Below we will focus on internal ranking of directories on > +the same filesystem and present a way to dynamically define ranks of > +directories such that operations still take locks in order of > +increasing rank, i.e., operations always take a lock with rank higher > +than that of any already held locks. Note that the idea of dynamic > +ranking has a formal basis and more details can be found in the > +academia paper RefFS_ from OSDI 2024. > + > + > +First, without considering cross-directory renames, a directory's rank > +is defined as its distance from the root, i.e., root directory has rank > +0 and each child directory's rank equals one plus its parent > +directory's rank. Because no directory is its own ancestor, every > +directory's rank can be uniquely determined starting from root to its > +descendants. This ranking accounts for the (locks parent, then child) > +nested locking in directory removal and same-directory rename. > + > +Then, consider the most complex case in a cross-directory rename, where the > +most locks need to be acquired. The rename may perform the following > +nested locking (only show lock statements for convenience). Here, dir1 > +and dir2 are the two parents and according to the locking rules, we > +know dir2 is not ancestor to dir1. child1 and child2 are source and > +target in a order according to the rules above and may not necessarily > +correspond to the child of dir1 and dir2. Ignore things in angle > +brackets for now. The locking order here can not be predetermined. For > +instance, rename(/a/x,/b/y) and rename(/b/y,/a/x) would acquire /a and > +/b in different order because neither parent is an ancestor of the > +other and we lock the parent of source first. But the observation is > +that after acquiring the filesystem lock, the precise locking order > +becomes known. We introduce a ghost state edge to help define the > +ranks (ghost state is a commonly used technique in verification which > +has no influence on the program execution and its only role is to > +assist the proof). > + > + 1. lock filesystem (->s_vfs_rename_mutex); > + 2. lock dir1->i_rwsem; <set edge to (dir1, dir2)> > + 3. lock dir2->i_rwsem; <set edge to (non-parent, child1) if child1 is > + a directory and non-parent is the one in dir1 and dir2 that is not > + parent to child1, otherwise, set edge to None> > + 4. lock child1->i_rwsem; <set edge to (child1, child2) if chil1 and > + child2 are directories, otherwise, set edge to None> > + 5. lock child2->i_rwsem; <set edge to None> > + > +The value of edge is either None or (inode1,inode2) > +representing a possible lock dependency from inode1 to inode2, i.e., > +some thread may request lock of inode2 with lock of inode1 held. There > +is only ever one edge value in a filesystem with edge set to None when > +the s_vfs_rename_mutex is not held and edge set to a determined value > +when s_vfs_rename_mutex is held. When edge is None, the ranking rules > +are the same as before (root is 0 and child is one plus its parent). > +When edge is (inode1, inode2), only the ranking rule for inode2 > +changes: inode2's rank is one plus the larger rank of inode2's parent > +and inode1. Intuitively, a directory's rank is the longest distance > +from root in the filesystem tree extended with edge. > + > +Now let's check the setting of edge in angle brackets in above code to > +see how it ensures cross-directory rename takes locks in increasing > +order in this case. After the lock statement at line 2, edge is set to > +(dir1,dir2). Because dir2 is not ancestor to dir1, the rank of dir1 > +stays the same and the rank of dir2 (as well as every other > +directories) can be uniquely determined. So at line 3, dir2's rank is > +higher than dir1's. Before the code acquires child1 at line 4, it > +already holds dir1 and dir2. We know one of dir1 and dir2 is parent to > +child1, so we update edge to (the non-parent of child1, child1). > +Because child1 cannot be ancestor to dir1 or dir2, every directory's > +rank is still uniquely defined (for the same reason as above). So at > +line 4, we ensure child1 has higher rank than dir1 and dir2. Similarly, > +we update edge to (child1, child2) so that at line 5, child2 has higher > +rank than held locks (dir1, dir2 and child1). We reset edge to None > +whenever we have acquired all directory locks. > + > +In a less complex case of cross-directory rename, the code may execute > +a prefix of above lines and we always reset edge to None after having > +acquired all directory locks. > + > +The ranking of directories may change under following cases: > + > + * directory removal/creation: a directory's rank is removed/added > + * rename: the effect of rename is either (1) moving a subtree under a > + node that is outside of the subtree (the directories in the subtree are > + all recalculated based on the node whose rank is not influenced) > + and potentially removing target (a directory's rank is removed) or > + (2) in case of RENAME_EXCHANGE, two subtrees exchange their > + positions. Because source is not a descendent of the target and > + target is not a descendent of source, the parents of the two > + subtrees cannot not be in these subtrees so the ranks of parents > + are not influenced. The two subtrees' ranks are recalculated based > + on their parents > + * edge updates in a cross-directory rename: setting edge to > + (inode1,inode2) or resetting edge to None modify the ranks of the > + inode2 subtree > + > +Initially in a filesystem, no directory is its own ancestor and > +directories' ranks can be uniquely determined. All these changes > +preserve the fact that at every time, the ranking of every directory > +can be uniquely determined. Thus no directory is its own ancestor > +(because if a directory is its own ancestor, its rank cannot be > +determined) and we prove the premise. Plus the fact that operations > +always take locks in order of increasing rank (can be easily verified). > +We can conclude the locking rules above are deadlock-free. > + > > Note that the check for having a common ancestor in cross-directory > rename is crucial - without it a deadlock would be possible. Indeed, > @@ -247,33 +276,6 @@ distant ancestor. Add a bunch of rmdir() attempts on all directories > in between (all of those would fail with -ENOTEMPTY, had they ever gotten > the locks) and voila - we have a deadlock. > > -Loop avoidance > -============== > - > -These operations are guaranteed to avoid loop creation. Indeed, > -the only operation that could introduce loops is cross-directory rename. > -Suppose after the operation there is a loop; since there hadn't been such > -loops before the operation, at least on of the nodes in that loop must've > -had its parent changed. In other words, the loop must be passing through > -the source or, in case of exchange, possibly the target. > - > -Since the operation has succeeded, neither source nor target could have > -been ancestors of each other. Therefore the chain of ancestors starting > -in the parent of source could not have passed through the target and > -vice versa. On the other hand, the chain of ancestors of any node could > -not have passed through the node itself, or we would've had a loop before > -the operation. But everything other than source and target has kept > -the parent after the operation, so the operation does not change the > -chains of ancestors of (ex-)parents of source and target. In particular, > -those chains must end after a finite number of steps. > - > -Now consider the loop created by the operation. It passes through either > -source or target; the next node in the loop would be the ex-parent of > -target or source resp. After that the loop would follow the chain of > -ancestors of that parent. But as we have just shown, that chain must > -end after a finite number of steps, which means that it can't be a part > -of any loop. Q.E.D. > - > While this locking scheme works for arbitrary DAGs, it relies on > ability to check that directory is a descendent of another object. Current > implementation assumes that directory graph is a tree. This assumption is > -- > 2.30.1 (Apple Git-130) >