Extended since the last repost. The branch is in #work.dotdot, still based at 5.6-rc1. Diffstat is Documentation/filesystems/path-lookup.rst | 7 +- fs/autofs/dev-ioctl.c | 6 +- fs/internal.h | 1 - fs/namei.c | 1465 ++++++++++++----------------- fs/namespace.c | 96 +- fs/open.c | 4 +- include/linux/namei.h | 4 +- 7 files changed, 675 insertions(+), 908 deletions(-) Individual patches are in the followups. Branch survives the local testing (including ltp and xfstests). Review and testing would be _very_ welcome; it does a lot of massage, so there had been a plenty of opportunities to fuck up and fail to spot that. The same goes for profiling - it doesn't seem to slow the things down, but that needs to be verified. If nobody screams, into -next it goes in a few days... Changes since v3: ".." series (part 6) got cleaner up, a bunch of pick_link() cleanups added (part 7) and more path_openat()/ do_last()/open_last_lookups() refactoring is added in the end (part 8). part 1: follow_automount() cleanups and fixes. Quite a bit of that function had been about working around the wrong calling conventions of finish_automount(). The problem is that finish_automount() misuses the primitive intended for mount(2) and friends, where we want to mount on top of the pile, even if something has managed to add to that while we'd been trying to lock the namespace. For automount that's not the right thing to do - there we want to discard whatever it was going to attach and just cross into what got mounted there in the meanwhile (most likely - the results of the same automount triggered by somebody else). Current mainline kinda-sorta manages to do that, but it's unreliable and very convoluted. Much simpler approach is to stop using lock_mount() in finish_automount() and have it bail out if something turns out to have been mounted on top where we wanted to attach. That allows to get rid of a lot of PITA in the caller. Another simplification comes from not trying to cross into the results of automount - simply ride through the next iteration of the loop and let it move into overmount. Another thing in the same series is divorcing follow_automount() from nameidata; that'll play later when we get to unifying follow_down() with the guts of follow_managed(). 4 commits, the second one fixes a hard-to-hit race. The first is a prereq for it. 1/69 do_add_mount(): lift lock_mount/unlock_mount into callers 2/69 fix automount/automount race properly 3/69 follow_automount(): get rid of dead^Wstillborn code 4/69 follow_automount() doesn't need the entire nameidata part 2: unifying mount traversals in pathwalk. Handling of mount traversal (follow_managed()) is currently called in a bunch of places. Each of them is shortly followed by a call of step_into() or an open-coded equivalent thereof. However, the locations of those step_into() calls are far from preceding follow_managed(); moreover, that preceding call might happen on different paths that converge to given step_into() call. It's harder to analyse that it should be (especially when it comes to liveness analysis) and it forces rather ugly calling conventions on lookup_fast()/atomic_open()/lookup_open(). The series below massages the code to the point when the calls of follow_managed() (and __follow_mount_rcu()) move into the beginning of step_into(). 5/69 make build_open_flags() treat O_CREAT | O_EXCL as implying O_NOFOLLOW gets EEXIST handling in do_last() past the step_into() call there. 6/69 handle_mounts(): start building a sane wrapper for follow_managed() rather than mangling follow_managed() itself (and creating conflicts with openat2 series), add a wrapper that will absorb the required interface changes. 7/69 atomic_open(): saner calling conventions (return dentry on success) struct path passed to it is pure out parameter; only dentry part ever varies, though - mnt is always nd->path.mnt. Just return the dentry on success, and ERR_PTR(-E...) on failure. 8/69 lookup_open(): saner calling conventions (return dentry on success) propagate the same change one level up the call chain. 9/69 do_last(): collapse the call of path_to_nameidata() struct path filled in lookup_open() call is eventually given to handle_mounts(); the only use it has before that is path_to_nameidata() call in "->atomic_open() has actually opened it" case, and there path_to_nameidata() is an overkill - we are guaranteed to replace only nd->path.dentry. So have the struct path filled only immediately prior to handle_mounts(). 10/69 handle_mounts(): pass dentry in, turn path into a pure out argument now all callers of handle_mount() are directly preceded by filling struct path it gets. path->mnt is nd->path.mnt in all cases, so we can pass just the dentry instead and fill path in handle_mount() itself. Some boilerplate gone, path is pure out argument of handle_mount() now. 11/69 lookup_fast(): consolidate the RCU success case massage to gather what will become an RCU case equivalent of handle_mounts(); basically, that's what we do if revalidate succeeds in RCU case of lookup_fast(), including unlazy and fallback to handle_mounts() if __follow_mount_rcu() says "it's too tricky". 12/69 teach handle_mounts() to handle RCU mode ... and take that into handle_mount() itself. The other caller of __follow_mount_rcu() is fine with the same fallback (it just didn't bother since it's in the very beginning of pathwalk), switched to handle_mount() as well. 13/69 lookup_fast(): take mount traversal into callers Now we are getting somewhere - both RCU and non-RCU success cases of lookup_fast() are ended with the same return handle_mounts(...); move that to the callers - there it will merge with the identical calls that had been on the paths where we had to do slow lookups. lookup_fast() returns dentry now. 14/69 step_into() callers: dismiss the symlink earlier dismiss the symlink being traversed as soon as we know we are done looking at its body; do that directly from step_into() callers, don't leave it for step_into() to do. 15/69 new step_into() flag: WALK_NOFOLLOW use step_into() instead of open-coding it in handle_lookup_down(). Add a flag for "don't follow symlinks regardless of LOOKUP_FOLLOW" for that (and eventually, I hope, for .. handling). Now *all* calls of handle_mounts() and step_into() are right next to each other. 16/69 fold handle_mounts() into step_into() ... and we can move the call of handle_mounts() into step_into(), getting a slightly saner calling conventions out of that. 17/69 LOOKUP_MOUNTPOINT: fold path_mountpointat() into path_lookupat() another payoff from 14/17 - we can teach path_lookupat() to do what path_mountpointat() used to. And kill the latter, along with its wrappers. 18/69 expand the only remaining call of path_lookup_conditional() minor cleanup - RIP path_lookup_conditional(). Only one caller left. Changes so far: * mount traversal is taken into step_into(). * lookup_fast(), atomic_open() and lookup_open() calling conventions are slightly changed. All of them return dentry now, instead of returning an int and filling struct path on success. For lookup_fast() the old "0 for cache miss, 1 for cache hit" is replaced with "NULL stands for cache miss, dentry - for hit". * step_into() can be called in RCU mode as well. Takes nameidata, WALK_... flags, dentry and, in RCU case, corresponding inode and seq value. Handles mount traversals, decides whether it's a symlink to be followed. Error => returns -E...; symlink to follow => returns 1, puts symlink on stack; non-symlink or symlink not to follow => returns 0, moves nd->path to new location. * LOOKUP_MOUNTPOINT introduced; user_path_mountpoint_at() and friends became calls of user_path_at() et.al. with LOOKUP_MOUNTPOINT in flags. part 3: untangling the symlink handling. Right now when we decide to follow a symlink it happens this way: * step_into() decides that it has been given a symlink that needs to be followed. * it calls pick_link(), which pushes the symlink on stack and returns 1 on success / -E... on error. Symlink's mount/dentry/seq is stored on stack and the inode is stashed in nd->link_inode. * step_into() passes that 1 to its callers, which proceed to pass it up the call chain for several layers. In all cases we get to get_link() call shortly afterwards. * get_link() is called, picks the inode stashed in nd->link_inode by the pick_link(), does some checks, touches the atime, etc. * get_link() either picks the link body out of inode or calls ->get_link(). If it's an absolute symlink, we move to the root and return the relative portion of the body; if it's a relative one - just return the body. If it's a procfs-style one, the call of nd_jump_link() has been made and we'd moved to whatever location is desired. And return NULL, same as we do for symlink to "/". * the caller proceeds to deal with the string returned to it. The sequence is the same in all cases (nested symlink, trailing symlink on lookup, trailing symlink on open), but its pieces are not close to each other and the bit between the call of pick_link() and (inevitable) call of get_link() afterwards is not easy to follow. Moreover, a bunch of functions (walk_component/lookup_last/do_last) ends up with the same conventions for return values as step_into(). And those conventions (see above) are not pretty - 0/1/-E... is asking for mistakes, especially when returned 1 is used only to direct control flow on a rather twisted way to matching get_link() call. And that path can be seriously twisted. E.g. when we are trying to open /dev/stdin, we get the following sequence: * path_init() has put us into root and returned "/dev/stdin" * link_path_walk() has eventually reached /dev and left <LAST_NORM, "stdin"> in nd->last_type/nd->last * we call do_last(), which sees that we have LAST_NORM and calls lookup_fast(). Let's assume that everything is in dcache; we get the dentry of /dev/stdin and proceed to finish_lookup:, where we call step_into() * it's a symlink, we have LOOKUP_FOLLOW, so we decide to pick the damn thing. Into the stack it goes and we return 1. * do_last() sees 1 and returns it. * trailing_symlink() is called (in the top-level loop) and it calls get_link(). OK, we get "/proc/self/fd/0" for body, move to root again and return "proc/self/fd/0". * link_path_walk() is given that string, eventually leading us into /proc/self/fd, with <LAST_NORM, "0"> left as the component to handle. * do_last() is called, and similar to the previous case we eventually reach the call of step_into() with dentry of /proc/self/fd/0. * _now_ we can discard /dev/stdin from the stack (we'd been using its body until now). It's dropped (from step_into()) and we get to look at what we'd been given. A symlink to follow, so on the stack it goes and we return 1. * again, do_last() passes 1 to caller * trailing_symlink() is called and calls get_link(). * this time it's a procfs symlink and its ->get_link() method moves us to the mount/dentry of our stdin. And returns NULL. But the fun doesn't stop yet. * trailing_symlink() returns "" to the caller * link_path_walk() is called on that and does nothing whatsoever. * do_last() is called and sees LAST_BIND left by the get_link(). It calls handle_dots() * handle_dots() drops the symlink from stack and returns * do_last() *FINALLY* proceeds to the point after its call of step_into() (finish_open:) and gets around to opening the damn thing. Making sense of the control flow through all of that is not fun, to put it mildly; debugging anything in that area can be a massive PITA, and this example has touched only one of 3 cases. Arguably, the worst one, but... Anyway, it turns out that this code can be massaged to considerably saner shape - both in terms of control flow and wrt calling conventions. 19/69 merging pick_link() with get_link(), part 1 prep work: move the "hardening" crap from trailing_symlink() into get_link() (conditional on the absense of LOOKUP_PARENT in nd->flags). We'll be moving the calls of get_link() around quite a bit through that series, and the next step will be to eliminate trailing_symlink(). 20/69 merging pick_link() with get_link(), part 2 fold trailing_symlink() into lookup_last() and do_last(). Now these are returning strings; it's not the final calling conventions, but it's almost there. NULL => old 0, we are done. ERR_PTR(-E...) => old -E..., we'd failed. string => old 1, and the string is the symlink body to follow. Just as for trailing_symlink(), "/" and procfs ones (where get_link() returns NULL) yield "", so the ugly song and dance with no-op trip through link_path_walk()/handle_dots() still remains. 21/69 merging pick_link() with get_link(), part 3 elimination of that round-trip. In *all* cases having get_link() return NULL on such symlinks means that we'll proceed to drop the symlink from stack and get back to the point near that get_link() call - basically, where we would be if it hadn't been a symlink at all. The path by which we are getting there depends upon the call site; the end result is the same in all cases - such symlinks (procfs ones and symlink to "/") are fully processed by the time get_link() returns, so we could as well drop them from the stack right in get_link(). Makes life simpler in terms of control flow analysis... And now the calling conventions for do_last() and lookup_last() have reached the final shape - ERR_PTR(-E...) for error, NULL for "we are done", string for "traverse this". 22/69 merging pick_link() with get_link(), part 4 now all calls of walk_component() are followed by the same boilerplate - "if it has returned 1, call get_link() and if that has returned NULL treat that as if walk_component() has returned 0". Eliminate by folding that into walk_component() itself. Now walk_component() return value conventions have joined those of do_last()/lookup_last(). 23/69 merging pick_link() with get_link(), part 5 same as for the previous, only this time the boilerplate migrates one level down, into step_into(). Only one caller of get_link() left, step_into() has joined the same return value conventions. 24/69 merging pick_link() with get_link(), part 6 move that thing into pick_link(). Now all traces of "return 1 if we are following a symlink" are gone. 25/69 finally fold get_link() into pick_link() ta-da - expand get_link() into the only caller. As a side benefit, we get rid of stashing the inode in nd->link_inode - it was done only to carry that piece of information from pick_link() to eventual get_link(). That's not the main benefit, though - the control flow became considerably easier to reason about. For what it's worth, the example above (/dev/stdin) becomes * path_init() has put us into root and returned "/dev/stdin" * link_path_walk() has eventually reached /dev and left <LAST_NORM, "stdin"> in nd->last_type/nd->last * we call do_last(), which sees that we have LAST_NORM and calls lookup_fast(). Let's assume that everything is in dcache; we get the dentry of /dev/stdin and proceed to finish_lookup:, where we call step_into() * it's a symlink, we have LOOKUP_FOLLOW, so we decide to pick the damn thing. On the stack it goes and we get its body. Which is "/proc/self/fd/0", so we move to root and return "proc/self/fd/0". * do_last() sees non-NULL and returns it - whether it's an error or a pathname to traverse, we hadn't reached something we'll be opening. * link_path_walk() is given that string, eventually leading us into /proc/self/fd, with <LAST_NORM, "0"> left as the component to handle. * do_last() is called, and similar to the previous case we eventually reach the call of step_into() with dentry of /proc/self/fd/0. * _now_ we can discard /dev/stdin from the stack (we'd been using its body until now). It's dropped (from step_into()) and we get to look at what we'd been given. A symlink to follow, so on the stack it goes. This time it's a procfs symlink and its ->get_link() method moves us to the mount/dentry of our stdin. And returns NULL. So we drop symlink from stack and return that NULL to caller. * that NULL is returned by step_into(), same as if we had just moved to a non-symlink. * do_last() proceeds to open the damn thing. Some low-hanging fruits become available: LAST_BIND can be removed and the predicate controlling may_follow_link() we'd moved into pick_link() in #18 can be made more straightforward: 26/69 LAST_BIND removal The only reason to keep it had just been eliminated - it was needed to route the control flow through that weird last iteration through the loop. With that iteration gone... 27/69 invert the meaning of WALK_FOLLOW 28/69 pick_link(): check for WALK_TRAILING, not LOOKUP_PARENT In #18 the checks specific to trailing symlinks got moved into pick_link(), where they were made conditional upon LOOKUP_PARENT in nd->flags. That works, but it's more subtle than I would like it to be - it depends upon the dynamic state (nd->flags) which gets changed through the pathwalk and it's sensitive to exact locations where we flip LOOKUP_PARENT. Now we have a more robust way to do that - the call chains that end up in pick_link() with LOOKUP_PARENT in nd->flags are those that had WALK_TRAILING passed to the immediate caller of pick_link() (step_into()). So we can pass WALK_... down to pick_link() and turn the check into explicit "if we are passed WALK_TRAILING, it's a trailing symlink and we need to apply the checks in may_follow()". We could, in principle, reorder these two commits into the very beginning of symlink series; that would make #18 slightly simpler at the cost of (marginally) more boilerplate to carry through the get_link() call moves. Not sure if it's worth doing, though... 29/69 link_path_walk(): simplify stack handling Another cleanup that becomes possible is handling of the stack(s). We use nd->stack to store two things: pinning down the symlinks we are resolving and resuming the name traversal when a nested symlink is finished. Currently, nd->depth is used to keep track of both. It's 0 when we call link_path_walk() for the first time (for the pathname itself) and 1 on all subsequent calls (for trailing symlinks, if any). That's fine, as far as pinning symlinks goes - when handling a trailing symlink, the string we are interpreting is the body of symlink pinned down in nd->stack[0]. It's rather inconvenient with respect to handling nested symlinks, though - when we run out of a string we are currently interpreting, we need to decide whether it's a nested symlink (in which case we need to pick the string saved back when we started to interpret that nested symlink and resume its traversal) or not (in which case we are done with link_path_walk()). Current solution is a bit of a kludge - in handling of trailing symlink (in lookup_last() and open_last_lookups() we clear nd->stack[0].name. That allows link_path_walk() to use the following rules when running out of a string to interpret: * if nd->depth is zero, we are at the end of pathname itself. * if nd->depth is positive, check the saved string; for nested symlink it will be non-NULL, for trailing symlink - NULL. It works, but it's rather non-obvious. Note that we have two sets: the set of symlinks currently being traversed and the set of postponed pathname tails. The former is stored in nd->stack[0..nd->depth-1].link and it's valid throught the pathname resolution; the latter is valid only during an individual call of link_path_walk() and it occupies nd->stack[0..nd->depth-1].name for the first call of link_path_walk() and nd->stack[1..nd->depth-1].name for subsequent ones. The kludge is basically a way to recognize the second set becoming empty. The things get simpler if we keep track of the second set's size explicitly and always store it in nd->stack[0..depth-1].name. We access the second set only inside link_path_walk(), so its size can live in a local variable; that way the check becomes trivial without the need of that kludge. 30/69 namei: have link_path_walk() maintain LOOKUP_PARENT just set it on the entry into link_path_walk() and clear when we get to the last component. Removes boilerplate from lookup_last() and do_last(). part 4. some mount traversal cleanups. 31/69 massage __follow_mount_rcu() a bit make it more similar to non-RCU counterpart 32/69 new helper: traverse_mounts() the guts of follow_managed() are very similar to follow_down(). The calling conventions are different (follow_managed() works with nameidata, follow_down() - with standalone struct path), but the core loop is pretty much the same in both. Turned that loop into a common helper (traverse_mounts()) and since follow_managed() becomes a very thin wrapper around it, expand follow_managed() at its only call site (in handle_mounts()), part 5. do_last() untangling. Control flow in do_last() is an atrocity, and liveness analysis in there is rather painful. What follows is a massage of that thing into (hopefully) more straightforward shape; by the end of the series it's still unpleasant, but at least easier to follow. A major source of headache is treatment of "we'd already managed to open it in ->atomic_open()" and "we'd just created that sucker" cases - that's what gives complicated control flow graph. As it is, we have the following horror: # /------* ends with . or ..? # | | # | /---* found in dcache, no O_CREAT? | | | | | # call lookup_open() here. | | | | | *---------------\ already opened in ->atomic_open()? | | | # | | *---\ | freshly created file? | | | # | | \---+ | | finish_lookup: | # | | | *---------------------> is it a symlink? \------+ | | finish_open: # | | +--/ | finish_open_created: # | +---------------/ opened: # To make it even more unpleasant, there is quite a bit of similar, but not entirely identical logics on parallel branches, some of it buried in lookup_open() *and* atomic_open() called by it. Keeping track of that has been hard and that had lead to more than one bug. 33/69 atomic_open(): return the right dentry in FMODE_OPENED case As it is, several invariants do not hold in "we'd already opened it in ->atomic_open()" case. In particular, nd->path.dentry might be pointing to the wrong place by the time we return to do_last() - on that codepath we don't care anymore. That both makes it harder to reason about and serves as an obstacle to transformations that would untangle that mess. Fortunately, it's not hard to regularize. 34/69 atomic_open(): lift the call of may_open() into do_last() may_open() is called before vfs_open() in "hadn't opened in ->atomic_open()" case. Rightfully so, since vfs_open() for e.g. devices can have side effects. In "opened in ->atomic_open()" case we have to do it after the actual opening - the whole point is to combine open with lookup and we only get the information needed for may_open() after the combined lookup/open has happened. That's OK - no side effects are possible in that case. However, we don't have to keep that call of may_open() inside fs/namei.c:atomic_open(); as the matter of fact, lifting it into do_last() allows to simplify life there... 35/69 do_last(): merge the may_open() calls ... since now we have the "it's already opened" case in do_last() rejoin the main path at earlier point. At that point the horror graph from above has become # /------* ends with . or ..? # | | # | /---* found in dcache, no O_CREAT? | | | | | # call lookup_open() here. | | | | | *---------------\ already opened in ->atomic_open()? | | | # | | *---\ | freshly created file? | | | # | | \---+ | | finish_lookup: | # | | | *---------------------> is it a symlink? \------+ | | finish_open: # | | +--/------------/ finish_open_created: # 36/69 do_last(): don't bother with keeping got_write in FMODE_OPENED case Another source of unpleasantness is an attempt to be clever and keep track of write access status; the thing is, it doesn't really buy us anything - we could as well drop it right after the lookup_open() and only regain it for truncation, should such be needed. Makes for much simpler cleanups on failures and sets the things up for unification of "already opened" and "new file" branches with the main path... 37/69 do_last(): rejoing the common path earlier in FMODE_{OPENED,CREATED} case ... which we do here. 38/69 do_last(): simplify the liveness analysis past finish_open_created It also makes possible to shrink the liveness intervals for local variables. 39/69 do_last(): rejoin the common path even earlier in FMODE_{OPENED,CREATED} case Further unification of parallel branches. At that point we get # /------* ends with . or ..? # | | # | /---* found in dcache, no O_CREAT? | | | | | # call lookup_open() here. | | | | | *---\ opened by ->atomic_open() or freshly creatd? | \---+ | finish_lookup: | # | | *---------------------> is it a symlink? \------+ | finish_open: | | +--/ finish_open_created: # with very little work done between finish_open: and finish_open_created:, as well as on any of the side branches. Moreover, we have a pretty clear separation: most of the work on _opening_ is after finish_open_created (some of it - conditional), while the work on lookups and creation is all before that point. Even better, most of the local variables are used either only before or only after that cutoff point. 40/69 split the lookup-related parts of do_last() into a separate helper ... which allows to separate the lookup-related parts from open-related ones. I'm not saying I'm entirely happy with the resulting state of do_last() clusterfuck, but it got a lot easier to follow and reason about. There are more cleanups possible (and needed) in there, though - there will be followups. part 6: ".." handling The main problem with .. traversal is that it has open-coded mount crossing in the end. That mount crossing used to be identical to that done after the normal name components, but it got overlooked in several series (most recently - openat2, prior to that - mount traps and automounts) and now we have an out-of-sync variant (two of them, actually - RCU and non-RCU cases) festering there and breeding hard-to-spot bugs. The most recent example was when openat2 got extra checks added to the normal mount crossing; it added the same to RCU case of .. handling, but missed the non-RCU one. Nobody noticed during the many rounds of review - me, Christoph and Linus included. Another issue is that we are heavier on the locking than we need to during the rootwards mount traversal there; traversing mounts in other direction (from mountpoint to mounted) gets by without grabbing mount_lock exclusive (or dirtying its cacheline in any way, for that matter), even in non-RCU case. The same should be the case for mounted-to-mountpoint transitions - except for the case of very rare races with mount --move and friends, we should be fine with just the seqcount checks there. The following is how .. handling behaves on just about any post-v7 Unix: while true if the caller is chrooted into directory // rare (A) parent = directory break if directory is absolute root // rare (B) parent = directory break if directory is mounted on top of mountpoint // 2nd most common (C) directory = mountpoint else // the most common (D) parent = the parent of directory (within its fs) break while something is mounted on parent // unusual setup (E) parent = whatever overmounts it return parent There are 3 paths that execution commonly takes, and they cover almost everything that occurs in practice: 1) [A] We are in / and we stay there 2) [C,D] We are in root of mounted filesystem, we step into the underlying mountpoint, then into the parent of mountpoint. 3) [D] The place we are in is not a root of mounted filesystem, we step into its parent. These cases are obvious. However, other execution paths are possible; in fact, the only constraint is that if we leave the first loop via (A) or (B), the body of the second loop (i.e. going from mountpoint to mounted) will be executed at least as many times as (C) (going from mounted to mountpoint) had been. A closer look at the predicates in the above shows that "is absolute root" is actually "is root of a mount and that mount is not attached to anything" while "is mounted on top of mountpoint" is "is a root of a mount and the mount is attached to mountpoint". Which suggest the following transformation: choose_mountpoint(mount, &ancestor) while mount is attached to something d = mountpoint(mount) mount = parent(mount) if the caller is chrooted into <mount, d> break if d is a root of mount ancestor = <mount, d> return true return false handle_dotdot(directory) if unlikely(the caller is chrooted into directory) goto in_root if unlikely(directory is a root of some mount) if !choose_mountpoint(mount, &ancestor) goto in_root directory = ancestor parent = the parent of directory (within its fs) while unlikely(something is mounted on parent) parent = whatever overmounts it return parent in_root: parent = directory while unlikely(something is mounted on parent) parent = whatever overmounts it return parent In this form we have mounted-to-mountpoint mount traversals clearly separated. Moreover, required updates of pathwalk context (nameidata) can be packed into a call of the same primitive (step_into()) we use for moves into normal components, including the forward mount traversals. NO_XDEV and BENEATH checks (added by openat2 series) fit into that just fine - NO_XDEV at "directory = ancestor" part, BENEATH - at in_root. Since the forward mount traversal is done by step_into(), the regular NO_XDEV checks in there take care of the rest. The following series massages follow_dotdot/follow_dotdot_rcu() to that form and does choose_mountpoint() implementation with saner locking than what we do in mainline now - for RCU case we only need to check mount_lock seqcount once (in the caller), for non-RCU we can use a loop similar to what lookup_mnt() does for forward traversals. 41/69 path_connected(): pass mount and dentry separately 42/69 path_parent_directory(): leave changing path->dentry to callers 43/69 follow_dotdot(): expand the call of path_parent_directory() currently switching to parent is done inside path_parent_directory(), called from the loop in follow_dotdot(). These 3 commits lift that into the loop in follow_dotdot() itself... 44/69 follow_dotdot{,_rcu}(): lift switching nd->path to parent out of loop 45/69 follow_dotdot{,_rcu}(): lift LOOKUP_BENEATH checks out of loop ... and out of the loop(s) (both on the RCU and non-RCU sides) Next part is to replace the second halves (crossing into parent and whatever might be overmounting it) to step_into(). 46/69 move handle_dots(), follow_dotdot() and follow_dotdot_rcu() past step_into() pure move - get them into the right place 47/69 handle_dots(), follow_dotdot{,_rcu}(): preparation to switch to step_into() convert to returning ERR_PTR()/NULL instead of -E.../0 - that's what step_into() returns and the callers are actually happier that way. 48/69 follow_dotdot{,_rcu}(): switch to use of step_into() ... and switch both to it. Now the RCU and non-RCU variants of the loop that used to do forward mount traversal on .. are replaced with step_into() calls... 49/69 lift all calls of step_into() out of follow_dotdot/follow_dotdot_rcu ... which can be consolidated. We are done with the forward traversal parts. 50/69 follow_dotdot{,_rcu}(): massage loops 51/69 follow_dotdot_rcu(): be lazy about changing nd->path 52/69 follow_dotdot(): be lazy about changing nd->path get the rootwards traversal into shape described above 53/69 helper for mount rootwards traversal 54/69 non-RCU analogue of the previous commit ... and introduce choose_mountpoint{,_rcu}(), switching both RCU and non-RCU variants to it. 55/69 fs/namei.c: kill follow_mount() detritus removal - the only remaining caller (path_pts()) ought to use follow_down() anyway. That's about where the previous version of patchset used to end. part 7: pick_link() and friends. 56/69 pick_link(): more straightforward handling of allocation failures There's a rather annoying wart in pick_link() handling of stack allocation failures. In RCU mode we try to do GFP_ATOMIC allocation; if that fails, we need to unlazy and retry with GFP_NORMAL. The problem is that we need to unlazy both the stuff in nameidata *and* the link we are about to push onto stack. We need to do the link first (after successful unlazy we'll have rcu_read_lock() already dropped, so it would be too late). The question is what to do if trying to legitimize link fails. We might need to drop references, so that can't happen until we drop rcu_read_lock(). OTOH, we have no place to stash it until the time it's normally done on error. Result was microoptimized and confusing as hell - explaining the reasons why we had to do anything special, let alone why this and not something else was highly non-obvious. Turns out that there's a fairly simple (and easily explained) solution, avoiding that mess. 57/69 pick_link(): pass it struct path already with normal refcounting rules Some of the struct path instances share the mount reference with nd->path. That's done to avoid grabbing/dropping mount references all the time; unfortunately, it had been the source of quite a few bugs, with those beasts confused for the normal ones (and vice versa) in some failure exit. A lot of pathwalk-related code used to be exposed to those, mostly due to unfortunate calling conventions. After the patches earlier in the series that exposure is much more limited - step_into() and handle_mounts() (where they are really needed - they'd been introduced for a good reason) is pretty much all that is left. pick_link() is slightly exposed - it gets link in such form, but immediately converts to the regular refcounting rules. Doing that conversion in the caller (step_link()) makes for simpler logics. 58/69 fold path_to_nameidata() into its only remaining caller Another bit of exposure gone - path_to_nameidata() used to be a primitive for working with those beasts and now it is called only in step_into(). Expanding it there makes the things easier to read, actually. 59/69 pick_link(): take reserving space on stack into a new helper 60/69 reserve_stack(): switch to __nd_alloc_stack() 61/69 __nd_alloc_stack(): make it return bool Take the stack allocation out of pick_link() and clean it up. That's not the end of it for pick_link() - there are other pieces that would be better off in separate helpers; that's for the latter, though. part 8: more untangling of do_last() While we'd already separated the lookup-related parts of do_last() into a separate function (close analogue of lookup_last()), the top-level loop is still calling do_last(), which starts with calling open_last_lookups(). If that returns non-NULL, we leave do_last() immediately and go through the next iteration of the top-level loop. Otherwise we proceed to do the work on actual opening and whether it succeeds or not, we are done with pathwalk at that point - we know it's not a trailing symlink for us to follow. More natural way to express that is to have open_last_lookups() called by the loop, with the rest of do_last() done after that loop has ended. Takes a bit or prep work, though: 62/69 link_path_walk(): sample parent's i_uid and i_mode for the last component I tried to get that done by may_create_in_sticky(); however, we really need the parent we'd walked through, _NOT_ whatever the file is in at the time of check. The whole thing is about attacker guessing a name of something like a temp file uncautious victim is about to open in e.g. /tmp with bare O_CREAT (no O_EXCL). Attacker creates a file of his own there, lets the victim to open it and to start writing to attacker-owned "temporary" file. Then attacker modifies the content and gets the victim screwed. The check is predicated upon the directory being a sticky one; fair enough, but the same property that makes the attack possible allows the attacker to mkdir a non-sticky subdirectory there and move the file to it. If they get it after the victim has looked the damn thing up (still in the old place), but before it does may_create_in_sticky(), the parent of file at the moment when may_create_in_sticky() gets called is *NOT* a sticky directory at all. In other words, we really must sample mode/uid before doing the lookup. 63/69 take post-lookup part of do_last() out of loop That turns the loop much more similar to path_lookupat() one. And unlike the full do_last(), open_last_lookups() is fairly similar to lookup_last()/walk_component(). Next come some cleanups of open_last_lookups(): 64/69 open_last_lookups(): consolidate fsnotify_create() calls straightforward; the only thing to keep in mind is that we need it done before unlocking the parent. 65/69 open_last_lookups(): don't abuse complete_walk() when all we want is unlazy ... especially with big comment to the effect that this is all the call is going to do. 66/69 open_last_lookups(): lift O_EXCL|O_CREAT handling into do_open() we move O_EXCL|O_CREAT check from the end of open_last_lookups() (where we'd already decided we are not looking at symlink to follow) out into do_open() (== what's left of do_last()). That's one of the cases when analysis is much longer than the patch itself; see commit message for details. 67/69 open_last_lookups(): move complete_walk() into do_open() similar move for complete_walk() 68/69 atomic_open(): no need to pass struct open_flags anymore unused argument I'd missed back in 2016; should've removed it back then. 69/69 lookup_open(): don't bother with fallbacks to lookup+create That's one of the payoffs we get from #66/69 - the reasons why "we might not have permissions on parent/write access to the entire fs" logics had been so complicated (sometimes we just trim O_CREAT and call ->atomic_open(), sometimes we forcibly fall back to lookup+create path) no longer apply. O_TRUNC side effects wouldn't have been a problem for quite a while now (we call handle_truncate() in the end anyway, so trimming O_TRUNC for ->atomic_open() would be fine). O_CREAT|O_EXCL, OTOH, _did_ require the fallback - trimming O_CREAT would not suffice, since that would've disabled the checks in ->atomic_open() with nothing downstream to catch it and fail with EEXIST. Well, now we do have something downstream - the combination of FMODE_CREATED not set, while O_CREAT|O_EXCL is present will trigger just that. In addition to the simpler logics in lookup_open(), the last commit opens the way to something more interesting, but that will have to wait for the next cycle - I do have some stuff in that direction, but it changes the ->atomic_open() calling conventions and it's too late in this cycle for that. I'll post a separate RFC later this week. Basically, switching ->atomic_open() to returning dentry with the same interpretation of result as for ->lookup() would make it possible to get rid both of the weird DENTRY_NOT_SET thing and of the irregularity in refcounting we have when struct file is used as a vehicle for passing the lookup result to the caller. No extra arguments are needed, and return d_in_lookup(dentry) ? foo_lookup(dir, dentry, flags) : NULL; becomes a legitimate instance, open-coded in lookup_open() side by side with ->atomic_open() call, with codepaths converging immediately after that. Anyway, that's a separate story and I'm still not entirely sure about some of the details for calling conventions. Definitely not in this series...