OK, vfs.git #work.namei seems to survive xfstests. I think it cleans the things quite a bit, but it obviously needs more review and testing. Review and testing would be _very_ welcome; it does a lot of massage, so there had been a plenty of opportunities to fuck up and fail to spot that. The same goes for profiling - it doesn't seem to slow the things down, but that needs to be verified. It does include #work.openat2. Topology: 17 commits, followed by clean merge with #work.openat2, followed by 9 followups. The part is #work.openat2 is as posted by Aleksa; I can repost it, but I don't see much point. Description of the rest follows; patches themselves will be in followups. part 1: follow_automount() cleanups and fixes. Quite a bit of that function had been about working around the wrong calling conventions of finish_automount(). The problem is that finish_automount() misuses the primitive intended for mount(2) and friends, where we want to mount on top of the pile, even if something has managed to add to that while we'd been trying to lock the namespace. For automount that's not the right thing to do - there we want to discard whatever it was going to attach and just cross into what got mounted there in the meanwhile (most likely - the results of the same automount triggered by somebody else). Current mainline kinda-sorta manages to do that, but it's unreliable and very convoluted. Much simpler approach is to stop using lock_mount() in finish_automount() and have it bail out if something turns out to have been mounted on top where we wanted to attach. That allows to get rid of a lot of PITA in the caller. Another simplification comes from not trying to cross into the results of automount - simply ride through the next iteration of the loop and let it move into overmount. Another thing in the same series is divorcing follow_automount() from nameidata; that'll play later when we get to unifying follow_down() with the guts of follow_managed(). 4 commits, the second one fixes a hard-to-hit race. The first is a prereq for it. 1/17 do_add_mount(): lift lock_mount/unlock_mount into callers 2/17 fix automount/automount race properly 3/17 follow_automount(): get rid of dead^Wstillborn code 4/17 follow_automount() doesn't need the entire nameidata part 2: unifying mount traversals in pathwalk. Handling of mount traversal (follow_managed()) is currently called in a bunch of places. Each of them is shortly followed by a call of step_into() or an open-coded equivalent thereof. However, the locations of those step_into() calls are far from preceding follow_managed(); moreover, that preceding call might happen on different paths that converge to given step_into() call. It's harder to analyse that it should be (especially when it comes to liveness analysis) and it forces rather ugly calling conventions on lookup_fast()/atomic_open()/lookup_open(). The series below massages the code to the point when the calls of follow_managed() (and __follow_mount_rcu()) move into the beginning of step_into(). 5/17 make build_open_flags() treat O_CREAT | O_EXCL as implying O_NOFOLLOW gets EEXIST handling in do_last() past the step_into() call there. 6/17 handle_mounts(): start building a sane wrapper for follow_managed() rather than mangling follow_managed() itself (and creating conflicts with openat2 series), add a wrapper that will absorb the required interface changes. 7/17 atomic_open(): saner calling conventions (return dentry on success) struct path passed to it is pure out parameter; only dentry part ever varies, though - mnt is always nd->path.mnt. Just return the dentry on success, and ERR_PTR(-E...) on failure. 8/17 lookup_open(): saner calling conventions (return dentry on success) propagate the same change one level up the call chain. 9/17 do_last(): collapse the call of path_to_nameidata() struct path filled in lookup_open() call is eventually given to handle_mounts(); the only use it has before that is path_to_nameidata() call in "->atomic_open() has actually opened it" case, and there path_to_nameidata() is an overkill - we are guaranteed to replace only nd->path.dentry. So have the struct path filled only immediately prior to handle_mounts(). 10/17 handle_mounts(): pass dentry in, turn path into a pure out argument now all callers of handle_mount() are directly preceded by filling struct path it gets. path->mnt is nd->path.mnt in all cases, so we can pass just the dentry instead and fill path in handle_mount() itself. Some boilerplate gone, path is pure out argument of handle_mount() now. 11/17 lookup_fast(): consolidate the RCU success case massage to gather what will become an RCU case equivalent of handle_mounts(); basically, that's what we do if revalidate succeeds in RCU case of lookup_fast(), including unlazy and fallback to handle_mounts() if __follow_mount_rcu() says "it's too tricky". 12/17 teach handle_mounts() to handle RCU mode ... and take that into handle_mount() itself. The other caller of __follow_mount_rcu() is fine with the same fallback (it just didn't bother since it's in the very beginning of pathwalk), switched to handle_mount() as well. 13/17 lookup_fast(): take mount traversal into callers Now we are getting somewhere - both RCU and non-RCU success cases of lookup_fast() are ended with the same return handle_mounts(...); move that to the callers - there it will merge with the identical calls that had been on the paths where we had to do slow lookups. lookup_fast() returns dentry now. 14/17 new step_into() flag: WALK_NOFOLLOW use step_into() instead of open-coding it in handle_lookup_down(). Add a flag for "don't follow symlinks regardless of LOOKUP_FOLLOW" for that (and eventually, I hope, for .. handling). Now *all* calls of handle_mounts() and step_into() are right next to each other. 15/17 fold handle_mounts() into step_into() ... and we can move the call of handle_mounts() into step_into(), getting a slightly saner calling conventions out of that. 16/17 LOOKUP_MOUNTPOINT: fold path_mountpointat() into path_lookupat() another payoff from 14/17 - we can teach path_lookupat() to do what path_mountpointat() used to. And kill the latter, along with its wrappers. 17/17 expand the only remaining call of path_lookup_conditional() minor cleanup - RIP path_lookup_conditional(). Only one caller left. At that point we run out of things that can be done without textual conflicts with openat2 series. Changes so far: * mount traversal is taken into step_into(). * lookup_fast(), atomic_open() and lookup_open() calling conventions are slightly changed. All of them return dentry now, instead of returning an int and filling struct path on success. For lookup_fast() the old "0 for cache miss, 1 for cache hit" is replaced with "NULL stands for cache miss, dentry - for hit". * step_into() can be called in RCU mode as well. Takes nameidata, WALK_... flags, dentry and, in RCU case, corresponding inode and seq value. Handles mount traversals, decides whether it's a symlink to be followed. Error => returns -E...; symlink to follow => returns 1, puts symlink on stack; non-symlink or symlink not to follow => returns 0, moves nd->path to new location. * LOOKUP_MOUNTPOINT introduced; user_path_mountpoint_at() and friends became calls of user_path_at() et.al. with LOOKUP_MOUNTPOINT in flags. Next comes the merge with Aleksa's openat2 patchset; everything up to that point had been non-conflicting with it. That patchset has been posted earlier; it's in #work.openat2. The next series comes on top of the merge. part 3: untangling the symlink handling. Right now when we decide to follow a symlink it happens this way: * step_into() decides that it has been given a symlink that needs to be followed. * it calls pick_link(), which pushes the symlink on stack and returns 1 on success / -E... on error. Symlink's mount/dentry/seq is stored on stack and the inode is stashed in nd->link_inode. * step_into() passes that 1 to its callers, which proceed to pass it up the call chain for several layers. In all cases we get to get_link() call shortly afterwards. * get_link() is called, picks the inode stashed in nd->link_inode by the pick_link(), does some checks, touches the atime, etc. * get_link() either picks the link body out of inode or calls ->get_link(). If it's an absolute symlink, we move to the root and return the relative portion of the body; if it's a relative one - just return the body. If it's a procfs-style one, the call of nd_jump_link() has been made and we'd moved to whatever location is desired. And return NULL, same as we do for symlink to "/". * the caller proceeds to deal with the string returned to it. The sequence is the same in all cases (nested symlink, trailing symlink on lookup, trailing symlink on open), but its pieces are not close to each other and the bit between the call of pick_link() and (inevitable) call of get_link() afterwards is not easy to follow. Moreover, a bunch of functions (walk_component/lookup_last/do_last) ends up with the same conventions for return values as step_into(). And those conventions (see above) are not pretty - 0/1/-E... is asking for mistakes, especially when returned 1 is used only to direct control flow on a rather twisted way to matching get_link() call. And that path can be seriously twisted. E.g. when we are trying to open /dev/stdin, we get the following sequence: * path_init() has put us into root and returned "/dev/stdin" * link_path_walk() has eventually reached /dev and left <LAST_NORM, "stdin"> in nd->last_type/nd->last * we call do_last(), which sees that we have LAST_NORM and calls lookup_fast(). Let's assume that everything is in dcache; we get the dentry of /dev/stdin and proceed to finish_lookup:, where we call step_into() * it's a symlink, we have LOOKUP_FOLLOW, so we decide to pick the damn thing. Into the stack it goes and we return 1. * do_last() sees 1 and returns it. * trailing_symlink() is called (in the top-level loop) and it calls get_link(). OK, we get "/proc/self/fd/0" for body, move to root again and return "proc/self/fd/0". * link_path_walk() is given that string, eventually leading us into /proc/self/fd, with <LAST_NORM, "0"> left as the component to handle. * do_last() is called, and similar to the previous case we eventually reach the call of step_into() with dentry of /proc/self/fd/0. * _now_ we can discard /dev/stdin from the stack (we'd been using its body until now). It's dropped (from step_into()) and we get to look at what we'd been given. A symlink to follow, so on the stack it goes and we return 1. * again, do_last() passes 1 to caller * trailing_symlink() is called and calls get_link(). * this time it's a procfs symlink and its ->get_link() method moves us to the mount/dentry of our stdin. And returns NULL. But the fun doesn't stop yet. * trailing_symlink() returns "" to the caller * link_path_walk() is called on that and does nothing whatsoever. * do_last() is called and sees LAST_BIND left by the get_link(). It calls handle_dots() * handle_dots() drops the symlink from stack and returns * do_last() *FINALLY* proceeds to the point after its call of step_into() (finish_open:) and gets around to opening the damn thing. Making sense of the control flow through all of that is not fun, to put it mildly; debugging anything in that area can be a massive PITA, and this example has touched only one of 3 cases. Arguably, the worst one, but... Anyway, it turns out that this code can be massaged to considerably saner shape - both in terms of control flow and wrt calling conventions. 1/9 merging pick_link() with get_link(), part 1 prep work: move the "hardening" crap from trailing_symlink() into get_link() (conditional on the absense of LOOKUP_PARENT in nd->flags). We'll be moving the calls of get_link() around quite a bit through that series, and the next step will be to eliminate trailing_symlink(). 2/9 merging pick_link() with get_link(), part 2 fold trailing_symlink() into lookup_last() and do_last(). Now these are returning strings; it's not the final calling conventions, but it's almost there. NULL => old 0, we are done. ERR_PTR(-E...) => old -E..., we'd failed. string => old 1, and the string is the symlink body to follow. Just as for trailing_symlink(), "/" and procfs ones (where get_link() returns NULL) yield "", so the ugly song and dance with no-op trip through link_path_walk()/handle_dots() still remains. 3/9 merging pick_link() with get_link(), part 3 elimination of that round-trip. In *all* cases having get_link() return NULL on such symlinks means that we'll proceed to drop the symlink from stack and get back to the point near that get_link() call - basically, where we would be if it hadn't been a symlink at all. The path by which we are getting there depends upon the call site; the end result is the same in all cases - such symlinks (procfs ones and symlink to "/") are fully processed by the time get_link() returns, so we could as well drop them from the stack right in get_link(). Makes life simpler in terms of control flow analysis... And now the calling conventions for do_last() and lookup_last() have reached the final shape - ERR_PTR(-E...) for error, NULL for "we are done", string for "traverse this". 4/9 merging pick_link() with get_link(), part 4 now all calls of walk_component() are followed by the same boilerplate - "if it has returned 1, call get_link() and if that has returned NULL treat that as if walk_component() has returned 0". Eliminate by folding that into walk_component() itself. Now walk_component() return value conventions have joined those of do_last()/lookup_last(). 5/9 merging pick_link() with get_link(), part 5 same as for the previous, only this time the boilerplate migrates one level down, into step_into(). Only one caller of get_link() left, step_into() has joined the same return value conventions. 6/9 merging pick_link() with get_link(), part 6 move that thing into pick_link(). Now all traces of "return 1 if we are following a symlink" are gone. 7/9 finally fold get_link() into pick_link() ta-da - expand get_link() into the only caller. As a side benefit, we get rid of stashing the inode in nd->link_inode - it was done only to carry that piece of information from pick_link() to eventual get_link(). That's not the main benefit, though - the control flow became considerably easier to reason about. For what it's worth, the example above (/dev/stdin) becomes * path_init() has put us into root and returned "/dev/stdin" * link_path_walk() has eventually reached /dev and left <LAST_NORM, "stdin"> in nd->last_type/nd->last * we call do_last(), which sees that we have LAST_NORM and calls lookup_fast(). Let's assume that everything is in dcache; we get the dentry of /dev/stdin and proceed to finish_lookup:, where we call step_into() * it's a symlink, we have LOOKUP_FOLLOW, so we decide to pick the damn thing. On the stack it goes and we get its body. Which is "/proc/self/fd/0", so we move to root and return "proc/self/fd/0". * do_last() sees non-NULL and returns it - whether it's an error or a pathname to traverse, we hadn't reached something we'll be opening. * link_path_walk() is given that string, eventually leading us into /proc/self/fd, with <LAST_NORM, "0"> left as the component to handle. * do_last() is called, and similar to the previous case we eventually reach the call of step_into() with dentry of /proc/self/fd/0. * _now_ we can discard /dev/stdin from the stack (we'd been using its body until now). It's dropped (from step_into()) and we get to look at what we'd been given. A symlink to follow, so on the stack it goes. This time it's a procfs symlink and its ->get_link() method moves us to the mount/dentry of our stdin. And returns NULL. So we drop symlink from stack and return that NULL to caller. * that NULL is returned by step_into(), same as if we had just moved to a non-symlink. * do_last() proceeds to open the damn thing. part 4. some mount traversal cleanups. 8/9 massage __follow_mount_rcu() a bit make it more similar to non-RCU counterpart 9/9 new helper: traverse_mounts() the guts of follow_managed() are very similar to follow_down(). The calling conventions are different (follow_managed() works with nameidata, follow_down() - with standalone struct path), but the core loop is pretty much the same in both. Turned that loop into a common helper (traverse_mounts()) and since follow_managed() becomes a very thin wrapper around it, expand follow_managed() at its only call site (in handle_mounts()), That's where the series stands right now. FWIW, at 5.5-rc1 fs/namei.c had been 4867 lines, at the tip of #work.openat2 - 4998, at the tip of #work.namei (containing #work.openat2) - 4730... And IMO the thing has become considerably easier to follow. What's more, it might be possible to untangle the control flow in do_last() now. Probably a separate series, though - do_last() is one hell of a tarpit, so I'm not stepping into it for the rest of this cycle... _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers