This document describes usage and proof of various hashed locks introduced in this patch set. Signed-off-by: Imran Khan <imran.f.khan@xxxxxxxxxx> --- .../filesystems/kernfs-hashed-locks.rst | 214 ++++++++++++++++++ 1 file changed, 214 insertions(+) create mode 100644 Documentation/filesystems/kernfs-hashed-locks.rst diff --git a/Documentation/filesystems/kernfs-hashed-locks.rst b/Documentation/filesystems/kernfs-hashed-locks.rst new file mode 100644 index 000000000000..a000e6fcf78c --- /dev/null +++ b/Documentation/filesystems/kernfs-hashed-locks.rst @@ -0,0 +1,214 @@ +.. SPDX-License-Identifier: GPL-2.0-only + +=================== +kernfs hashed locks +=================== + +kernfs uses following hashed locks + +1. Hashed mutexes +2. Hashed rwsem + +In certain cases hashed rwsem needs to work in conjunction with a per-fs mutex +(Described further below).So this document describes this mutex as well. + +A kernfs_global_locks object (defined below) provides hashed mutexes, +hashed spinlocks and hashed rwsems. + + struct kernfs_global_locks { + struct mutex open_file_mutex[NR_KERNFS_LOCKS]; + struct rw_semaphore kernfs_rwsem[NR_KERNFS_LOCKS]; + }; + +For all hashed locks address of a kernfs_node object acts as hashing key. + +For the remainder of this document a node means a kernfs_node object. The +node can refer to a file, directory or symlink of a kernfs based file system. +Also a node's mutex or rwsem refers to hashed mutex, or hashed rwsem +corresponding to the node. It does not mean any locking construct embedded in +the kernfs_node itself. + +What is protected by hashed locks +================================= + +(1) There's one kernfs_open_file for each open file and all kernfs_open_file + instances corresponding to a kernfs_node are maintained in a list. + hashed mutexes or kernfs_global_locks.open_file_mutex[index] protects + this list. + +(2) Hashed rwsems or kernfs_global_locks.kernfs_rwsem[index] protects node's + state and synchronizes operations that change state of a node or depend on + the state of a node. + +(3) per-fs mutex (mentioned earlier) provides synchronization between lookup + and remove operations. It also protects against topology change. + While looking for a node we will not have address of corresponding node + so we can't acquire node's rwsem right from the beginning. + On the other hand a parallel remove operation for the same node can acquire + corresponding rwsem and go ahead with node removal. So it may happen that + search operation for the node finds and returns it but before it can be + pinned or used, the remove operation, that was going on in parallel, removes + the node and hence makes its any future use wrong. + per-fs mutex ensures that for competing search and remove operations only + one proceeds at a time and since object returned by search is pinned before + releasing the per-fs mutex, it will be available for subsequent usage. + + This per-fs mutex also protects against topology change during path walks. + During path walks we need to acquire and release rwsems corresponding to + directories so that these directories don't move and their children RB tree + does not change. Since these rwsems can't be taken under a spinlock, + kernfs_rename_lock can't be used and needed protection against topology + change is provided by per-fs mutex. + +Lock usage and proof +======================= + +(1) Hashed mutexes + + Since hashed mutexes protect the list of kernfs_open_file instances + corresponding to a kernfs_node, ->open and ->release backends of + file_operations need to acquire hashed mutex corresponding to kernfs_node. + Also when a kernfs_node is removed, all of its kernfs_open_file instances + are drained after deactivating the node. This drain operation acquires + hashed mutex to traverse list of kernfs_open_file instances. + So addition (via ->open), deletion (via ->release) and traversal + (during kernfs_drain) of kernfs_open_file list occurs in a synchronous + manner. + +(2) Hashed rwsems + + 3.1. A node's rwsem protects its state and needs to be acquired to: + 3.1.a. Remove the node + 3.1.b. Move the node + 3.1.c. Travers or modify a node's children RB tree (for + directories), i.e to add/remove files/subdirectories + within/from a directory. + 3.1.d. Modify or access node's inode attributes + + 3.2. Hashed rwsems are used in following operations: + + 3.2.a. Addition of a new node + + While adding a new kernfs_node under a kernfs directory + kernfs_add_one acquires directory node's rwsem for + writing. Clause 3.1.a ensures that directory exists + throughout the operation. Clause 3.1.c ensures proper + updation of children rb tree (i.e ->dir.children). + Clause 3.1.d ensures correct modification of inode + attribute to reflect timestamp of this operation. + If the directory gets removed while waiting for semaphore, + the subsequent checks in kernfs_add_one will fail resulting + in early bail out from kernfs_add_one. + + 3.2.b. Removal of a node + + Removal of a node involves recursive removal of all of its + descendants as well. per-fs mutex (i.e kernfs_rm_mutex) avoids + concurrent node removals even if the nodes are different. + + At first node's rwsem is acquired. Clause 3.1.c avoids parallel + modification of descendant tree and while holding this rwsem + each of the descendants are deactivated. + + Once a descendant has been deactivated and drained, its parent's + rwsem is taken. Clause 3.1.c ensures proper unlinking of this + descendant from its siblings. Clause 3.1.d ensures that parent's + inode attributes are correctly updated to record time stamp of + removal. + + 3.2.c. Movement of a node + + Moving or renaming a node (kernfs_rename_ns) acquires rwsem for + node and its old and new parents. Clauses 3.1.b and 3.1.c avoid + concurrent move operations for the same node. + Also if old parent of a node changes while waiting for rwsem, + the acquisition of rwsem for 3 involved nodes is attempted + again. It is always ensured that as far as old parent is + concerned, rwsem corresponding to current parent is acquired. + + 3.2.d. Reading a directory + + For diectory reading kernfs_fop_readdir acquires directory + node's rwsem for reading. Clause 3.1.c ensures a consistent view + of children RB tree. + As far as directroy being read is concerned, if it gets removed + while waiting for semaphore, the for loop that iterates through + children will be ineffective. So for this operation acquiring + directory node's rwsem for reading is enough. + + 3.2.e. Dentry revalidation + + A dentry revalidation (kernfs_dop_revalidate) can happen for a + negative or for a normal dentry. + For negative dentries we just need to check parent change, so in + this case acquiring parent kernfs_node's rwsem for reading is + enough. + For a normal dentry acquiring node's rwsem for reading is enough + (Clause 3.1.a and 3.1.b). + If node gets removed while waiting for the lock subsequent checks + in kernfs_dop_revalidate will fail and kernfs_dop_revalidate will + exit early. + + 3.2.f. kernfs_node lookup + + While searching for a node under a given parent + (kernfs_find_and_get_ns, kernfs_walk_and_get_ns) rwsem of parent + node is acquired for reading. Clause 3.1.c ensures a consistent + view of parent's children RB tree. To avoid parallel removal of + found node before it gets pinned, these operation make use of + per-fs mutex (kernfs_rm_mutex) as explained earlier. + This per-fs mutex is also taken during kernfs_node removal + (__kernfs_remove). + + If the node being searched gets removed while waiting for the + mutex or rwsem, the subsequent kernfs_find_ns or kernfs_walk_ns + will fail. + + 3.2.g. kenfs_node's inode lookup + + Looking up for inode instances via kernfs_iop_lookup involves + node lookup. So locks acquired are same as ones required in 3.2.f. + Also once node lookup is complete parent's rwsem is released and + rwsem of found node is acquired to get corresponding inode. + Since we are operating under per-fs kernfs_rm_mutex the found node + will not disappear in the middle. + + 3.2.h. Updating or reading inode attribute + + Interfaces that change inode attributes(i.e kernfs_setattr and + kernfs_iop_setattr) acquire node's rwsem for writing. + If the kernfs_node gets removed while waiting for the semaphore + the subsequent __kernfs_setattr will fail. + From 3.2.a and 3.2.b we know that updates due to addition or + removal of nodes will not happen in parallel. + So just locking the kernfs_node in these cases is enough to + guarantee correct modification of inode attributes. + Similarly the interfaces that read inode attributes + (i.e kernfs_iop_getattr, kernfs_iop_permission) just need to + acquire involved node's rwsem for reading. + + 3.2.i. kernfs file event generation + + kernfs_notify pins involved node before scheduling + kernfs_notify_work and kernfs_notify_workfn acquires node's + rwsem. Clauses in 3.1 ensure a consistent view of node state + throughout execution of work handler. + + 3.2.j. mount + + kernfs_fill_super, invoked during mount operation, acquires root + node's rwsem. During mount process there can't be other execution + contexts trying to move or delete the node so just locking the + involved node(i.e the root node) is enough. + + 3.2.k. while activating a node + + For a node that started as deactivated, kernfs_activate + activates the node. In this case acquiring node's rwsem is + enough. Since the node is not active yet any parallel removal + that wins the race for rwsem will skip this node and its + descendents. Also user space can't see a deactivated node so we + don't have any parallel access emanating from their as well. + + 3.3 For operations that involve locking multiple nodes at the same time + locks are acquired in order of their addresses. -- 2.30.2