Hi Al, Just some editing suggestions below... On 2/25/24 14:06, Al Viro wrote: > The text below is a rough approximation to what should, IMO, > end up in Documentation/filesystems/rcu-exposure.rst. It started > as a part of audit notes. Intended target audience of the text is > filesystem maintainers and/or folks who are reviewing fs code; > I hope the current contents is already useful to an extent, but > I'm pretty certain that it needs more massage before it could > go into the tree. > > Please, read and review - comments and suggestions would be > very welcome, both for contents and for markup - I'm *not* familiar > with ReST or the ways it's used in the kernel (in particular, footnotes > seem to get lost in PDF). > > > =================================================================== > The ways in which RCU pathwalk can ruin filesystem maintainer's day > =================================================================== > > The problem: exposure of filesystem code to lockless environment > ================================================================ > > Filesystem methods can usually count upon VFS-provided warranties > regarding the stability of objects they are called to act upon; at the > very least, they can expect the dentries/inodes/superblocks involved to > remain live throughout the operation. > > Life would be much more painful without that; however, such warranties > do not come for free. The problem is that access patterns are heavily > biased; every system call getting an absolute pathname will have to > start at root directory, etc. Having each of them in effect write > "I'd been here" on the same memory objects would cost quite a bit. > As the result, we try to keep the fast path stores-free, bumping > no refcounts and taking no locks. Details are described elsewhere, > but the bottom line for filesystems is that some methods may be called > with much looser warranties than usual. Of course, from the filesystem > POV each of those is a potential source of headache - you are asked to > operate on an object that might start to be torn down right under you, > possibly along with the filesystem instance it lives on. > > The last possibility might sound surprising, but a lazy pathwalk really > can run into a dentry on a filesystem that is getting shut down. > > Here's one of the scenarios for that to happen: > > 1. have two threads sharing fs_struct chdir'ed on that filesystem. > 2. lazy-umount it, so that the only thing holding it alive is > cwd of these threads. > 3. the first thread does relative pathname resolution > and gets to e.g. ->d_hash(). It's holding rcu_read_lock(). > 4. at the same time the second thread does fchdir(), moving to > different directory. > > In fchdir(2) we get to set_fs_pwd(), which set the current directory > to the new place and does mntput() on the old one. No RCU delays here, > we calculate the refcount of that mount and see that we are dropping > the last reference. We make sure that the pathwalk in progress in > the first thread will fail when it comes to legitimize_mnt() and do this > (in mntput_no_expire()):: > > init_task_work(&mnt->mnt_rcu, __cleanup_mnt); > if (!task_work_add(task, &mnt->mnt_rcu, TWA_RESUME)) > return; > > As we leave the syscall, we have __cleanup_mnt() run; it calls cleanup_mnt() > on our mount, which hits deactivate_super(). That was the last reference to > superblock. > > Voila - we have a filesystem shutdown right under the nose of a thread > running in ->d_hash() of something on that filesystem. Mutatis mutandis, > one can arrange the same for other methods called by rcu pathwalk. > > It's not easy to hit (especially if you want to get through the > entire ->kill_sb() before the first thread gets through ->d_hash()), > and it's probably impossible on the real hardware; on KVM it might be > borderline doable. However, it is possible and I would not swear that > other ways of arranging the same thing are equally hard to hit. > > The bottom line: methods that can be called in RCU mode need to > be careful about the per-superblock objects destruction. > > > Opting out > ========== > > To large extent a filesystem can opt out of RCU pathwalk; that loses all > scalability benefits whenever you filesystem gets involved in pathname your > resolution, though. If that's the way you choose to go, just make sure > that > > 1. any ->d_revalidate(), ->permission(), ->get_link() and ->get_inode_acl() > instance bails out if called by RCU pathwalk (see below for details). > Costs a couple of lines of boilerplate in each. > > 2. if some symlink inodes have ->i_link set to a dynamically allocated > object, that object won't be freed without an RCU delay. Anything > coallocated with inode is fine, so's anything freed from ->free_inode(). > Usually comes for free, just remember to avoid freeing directly > from ->destroy_inode(). > > 3. any ->d_hash() and ->d_compare() instances (if you have those) do > not access any filesystem objects. > > 4. there's no ->d_automount() instances in your filesystem. > > If your case does not fit the above, the easy opt-out is not for you. > If so, you'll have to keep reading... > > > What methods are affected? > ========================== > > The list of the methods that could run into that fun: > > ======================== ================================== ================= > method indication that the call is unsafe unstable objects > ======================== ================================== ================= > ->d_hash(d, ...) none - any call might be d > ->d_compare(d, ...) none - any call might be d > ->d_revalidate(d, f) f & LOOKUP_RCU d > ->d_manage(d, f) f d > ->permission(i, m) m & MAY_NOT_BLOCK i > ->get_link(d, i, ...) d == NULL i > ->get_inode_acl(i, t, f) f == LOOKUP_RCU i > ======================== ================================== ================= > > > Additionally, callback set by set_delayed_call() from unsafe call of > ->get_link() will be run in the same environment; that one is usually not > a problem, though. > > For the sake of completeness, three of LSM methods > (->inode_permission(), ->inode_follow_link() and ->task_to_inode()) > might be called in similar environment, but that's a problem for LSM > crowd, not for filesystem folks. > > > Any method call is, of course, required not to crash - no stepping on > freed memory, etc. All of the unsafe calls listed above are done under > rcu_read_lock(), so they are not allowed to block. Further requirements > vary between the methods. > > > Before going through the list of affected methods, several notes on > the things that _are_ guaranteed: > > * if a reference to struct dentry is passed to such call, it will > not be freed until the method returns. The same goes for a reference to > struct inode and to struct super_block pointed to by ->d_sb or ->i_sb > members of dentry and inode resp. Any of those might be in process of please spell out respectively. > being torn down or enter such state right under us; the entire point of > those unsafe calls is that we make them without telling anyone they'd > need to wait for us. > > * following ->d_parent and ->d_inode of such dentries is fine, > provided that it's done by READ_ONCE() (for ->d_inode the preferred > form is d_inode_rcu(dentry)). The value of ->d_parent is never going > to be NULL and it will again point to a struct dentry that will not be > freed until the method call finishes. The value of ->d_inode might > be NULL; if non-NULL, it'll be pointing to a struct inode that will > not be freed until the method call finishes. > > * none of the inodes passed to an unsafe call could have reached > fs/inode.c:evict() before the caller grabbed rcu_read_lock(). > > * for inodes 'not freed' means 'not entered ->free_inode()', so > anything that won't be destroyed until ->free_inode() is safe to access. > Anything synchronously destroyed in ->evict_inode() or ->destroy_inode() > is not safe; however, one can count upon the call_rcu() callbacks > issued in those yet to be entered. Note that unlike dentries and > superblocks, inodes are embedded into filesystem-private objects; > anything stored directly in the containing object is safe to access. > > * for dentries anything destroyed by ->d_prune() (synchronously or > not) is not safe; the same goes for the things synchronously destroyed > by ->d_release(). However, call_rcu() callbacks issued in ->d_release() > are yet to be entered. > > * for superblocks we can count upon call_rcu() callbacks issued > from inside the ->kill_sb() (including the ones issued from > ->put_super()) yet to be entered. You can also count upon > ->s_user_ns still being pinned and ->s_security still not > freed. > > * NOTE: we **can not** count upon the things like ->d_parent > being positive (or a directory); a race with rename()+rmdir()+mknod() > and you might find a FIFO as parent's inode. NULL is even easier - > just have the dentry and its ex-parent already past dentry_kill() > (which is a normal situation for eviction on memory pressure) and there > you go. Normally such pathologies are prevented by the locking (and > dentry refcounting), but... the entire point of that stuff is to avoid > informing anyone that we are there, so those mechanisms are bypassed. > What's more, if dentry is not pinned by refcount, grabbing its ->d_lock > will *not* suffice to prevent that kind of mess - the scenario with > eviction by memory pressure won't be prevented by that; you might have > grabbed ->d_lock only after the dentry_kill() had released it, and > at that point ->d_parent still points to what used to be the parent, > but there's nothing to prevent its eviction. > > > ->d_compare() > ------------- > > For ->d_compare() we just need to make sure it won't crash > when called for dying dentry - an incorrect return value won't harm the > caller in such case. False positives and false negatives alike - the > callers take care of that. To be pedantic, make that "false positives > do not cause problems unless they have ->d_manage()", but ->d_manage() > is present only on autofs and there's no autofs ->d_compare() instances. > [#f1]_ > > There is no indication that ->d_compare() is called in RCU mode; > the majority of callers are such, anyway, so we need to cope with that. > VFS guarantees that dentry won't be freed under us; the same goes for > the superblock pointed to by its ->d_sb. Name points to memory object > that won't get freed under us and length does not exceed the size of > that object. The contents of that object is *NOT* guaranteed to be > stable; d_move() might race with us, modifying the name. However, in > that case we are free to return an arbitrary result - the callers will > take care of both false positives and false negatives in such case. > The name we are comparing dentry with (passed in qstr) is stable, > thankfully... > > If we need to access any other data, it's up to the filesystem > to protect it. In practice it means that destruction of fs-private part > of superblock (and possibly unicode tables hanging off it, etc.) might > need to be RCU-delayed. > > *IF* you want the behaviour that varies depending upon the parent > directory, you get to be very careful with READ_ONCE() and watch out > for the object lifetimes. > > Basically, if the things get that tricky, ask for help. > Currently there are two such instances in the tree - proc_sys_compare() > and generic_ci_d_compare(). Both are... special. > > > ->d_hash() > ---------- > > For ->d_hash() on a dying dentry we are free to report any hash > value; the only extra requirement is that we should not return stray > hard errors. In other words, if we return anything other than 0 or > -ECHILD, we'd better make sure that this error would've been correct > before the parent started dying. Since ->d_hash() error-reporting is no hyphen IMO ^ > usually done to reject unacceptable names (too long, contain unsuitable > characters for this filesystem, etc.), that's really not a problem - > hard errors depend only upon the name, not the parent. > > Again, VFS guarantees that freeing of dentry and of the superblock > pointed to by dentry->d_sb won't happen under us. The name passed to > us (in qstr) is stable. If you need anything beyond that, you are > in the same situation as with ->d_compare(). Might want to RCU-delay > freeing private part of superblock (if that's what we need to access), > might want the same for some objects hanging off that (unicode tables, > etc.). If you need something beyond that - ask for help. > > > ->d_revalidate() > ---------------- > > For this one we do have an indication of call being unsafe - > flags & LOOKUP_RCU. With ->d_revalidate we are always allowed to bail > out and return -ECHILD; that will have the caller drop out of RCU mode. > We definitely need to do that if revalidate would require any kind of IO, > mutex-taking, etc.; we can't block in RCU mode. > > Quite a few instances of ->d_revalidate() simply treat LOOKUP_RCU > in flags as "return -ECHILD and be done with that"; it's guaranteed to > do the right thing, but you lose the benefits of RCU pathwalks whenever > you run into such dentry. > > Same as with the previous methods, we are guaranteed that > dentry and dentry->d_sb won't be freed under us. We are also guaranteed > that ->d_parent (which is *not* stable, so use READ_ONCE) points to a > struct dentry that won't get freed under us. As always with ->d_parent, > it's not NULL - for a detached dentry it will point to dentry itself. > d_inode_rcu() of dentry and its parent will be either NULL or will > point to a struct inode that won't get freed under us. Anything beyond s/tab/space/ > than that is not guaranteed. We may find parent to be negative - it can > happen if we race with d_move() and removal of old parent. In that case > just return -ECHILD and be done with that. > > On non-RCU side you could use dget_parent() instead - that > would give a positive dentry and its ->d_inode would remain stable. > dget_parent() has to be paired with dput(), though, so it's not usable > in RCU mode. > > If you need fs-private objects associated with dentry, its parent > inode(s) or superblock - see the general notes above on how to access > those. > > > ->d_manage() > ------------ > > Can be called in RCU mode; gets an argument telling it if it has > been called so. Pretty much autofs-only; for everyone's sanity sake, > don't inflict more of those on the kernel. Definitely don't do that > without asking first... > > > ->permission() > -------------- > > Can be called in RCU mode; that is indicated by MAY_NOT_BLOCK > in mask, and it can only happen for MAY_EXEC checks on directories. > In RCU mode it is not allowed to block, and it is allowed to bail out > by returning -ECHILD. It might be called for an inode that is getting > torn down, possibly along with its filesystem. Errors other than -ECHILD s/tab/space/ > should only be returned if they would've been returned in non-RCU mode; > several instances in procfs currently (6.5) run afoul of that one. That's > an instructive example, BTW - what happens is that proc_pid_permission() > uses proc_get_task() to find the relevant process. proc_get_task() > uses PID reference stored in struct proc_inode our inode is embedded > into; inode can't have been freed yet, so fetching ->pid member in that > is safe. However, using the value you've fetched is a different story > - proc_evict_inode() would have passed it to put_pid() and replaced > it with NULL. Unsafe caller has no way to tell if that is happening > right under it. Solution: stop zeroing ->pid in proc_evict_inode() > and move put_pid() from proc_pid_evict_inode() to proc_free_inode(). > That's not all that is needed (there's access to procfs-private part of > superblock as well), but it does make a good example of how such stuff > can be dealt with. > > Note that idmap argument is safe on all calls - its destruction > is rcu-delayed. > > The amount of headache is seriously reduced (for now) by the fact > that a lot of instances boil down to generic_permission() (which will > do the right thing in RCU mode) when mask is MAY_EXEC | MAY_NOT_BLOCK. > If we ever extend RCU mode to other ->permission() callers, the thing will > get interesting; that's not likely to happen, though, unless access(2) > goes there [this is NOT a suggestion, folks]. > > > ->get_link() > ------------ > > Again, this can be called in RCU mode. Even if your ->d_revalidate() s/tab/space/ > always returns -ECHILD in RCU mode and kicks the pathwalk out of it, > you can't assume that ->get_link() won't be reached [#f2]_. > NULL dentry argument is an indicator of unsafe call; if you can't handle > it, just return ERR_PTR(-ECHILD). Any allocations you need to do (and > with this method you really might need that) should be done with GFP_ATOMIC > in the unsafe case. > > Whatever you pass to set_delayed_call() is going to be called > in the same mode as ->get_link() itself; not a problem for most of the > instances. The string you return needs to stay there until the > callback gets called or, if no callback is set, until at least the > freeing of inode. As usual, for an unsafe call the inode might be > in process of teardown, possibly along with the hosting filesystem. > The usual considerations apply. The same, BTW, applies to whatever > you set in ->i_link - it must stay around at least until ->free_inode(). > > > ->get_inode_acl() > ----------------- > > Very limited exposure for that one - unsafe call is possible > only if you explicitly set ACL_DONT_CACHE as cached ACL value. > Only two filesystems (fuse and overlayfs) even bother. Unsafe call > is indicated by explicit flag (the third argument of the method), > bailout is done by returning ERR_PTR(-CHILD) and the usual considerations > apply for any access to data structures you might need to do. > > .. rubric:: Footnotes > > .. [#f1] > > Some callers prevent being called for dying dentry (holding ->d_lock and > having verified !d_unhashed() or finding it in the list of inode's aliases > under ->i_lock). For those the scenario in question simply cannot arise. > > Some follow the match with lockref_get_not_dead() and treat the failure > as mismatch. That takes care of false positives, and false negatives on > dying dentry are still correct - we simply pretend to have lost the race. > > The only caller that does not fit into the classes above is > __d_lookup_rcu_op_compare(). There we sample ->d_seq and verify > !d_unhashed() before calling ->d_compare(). That is not enough to > prevent dentry from starting to die right under us; however, the sampled > value of ->d_seq will be rechecked when the caller gets to step_into(), > so for a false positive we will end up with a mismatch. The corner case > around ->d_manage() is due to the handle_mounts() done before step_into() > gets to ->d_seq validation... > > .. [#f2] > > binding a symlink on top of a regular file on another filesystem is possible > and that's all it takes for RCU pathwalk to get there. > Thanks for the documentation. -- #Randy