On Sun, 2020-01-19 at 03:14 +0000, Al Viro wrote: > OK, vfs.git #work.namei seems to survive xfstests. I think > it cleans the things quite a bit, but it obviously needs more > review and testing. > > Review and testing would be _very_ welcome; it does a lot > of massage, so there had been a plenty of opportunities to fuck up > and fail to spot that. The same goes for profiling - it doesn't > seem to slow the things down, but that needs to be verified. I have run my usual tests (the second run of my submount-test is still going) and they have run through fine. I spend what time I can looking through the series tomorrow but will probably need to complete that when I return from my trip to Albany (Western Australia) some time on Friday. > > It does include #work.openat2. Topology: 17 commits, followed > by clean merge with #work.openat2, followed by 9 followups. The > part is #work.openat2 is as posted by Aleksa; I can repost it, but > I don't see much point. Description of the rest follows; patches > themselves will be in followups. > > part 1: follow_automount() cleanups and fixes. > > Quite a bit of that function had been about working around the > wrong calling conventions of finish_automount(). The problem is that > finish_automount() misuses the primitive intended for mount(2) and > friends, where we want to mount on top of the pile, even if something > has managed to add to that while we'd been trying to lock the > namespace. > For automount that's not the right thing to do - there we want to > discard > whatever it was going to attach and just cross into what got mounted > there in the meanwhile (most likely - the results of the same > automount > triggered by somebody else). Current mainline kinda-sorta manages to > do > that, but it's unreliable and very convoluted. Much simpler approach > is to stop using lock_mount() in finish_automount() and have it bail > out if something turns out to have been mounted on top where we > wanted > to attach. That allows to get rid of a lot of PITA in the caller. > Another simplification comes from not trying to cross into the > results > of automount - simply ride through the next iteration of the loop and > let it move into overmount. > > Another thing in the same series is divorcing > follow_automount() > from nameidata; that'll play later when we get to unifying > follow_down() > with the guts of follow_managed(). > > 4 commits, the second one fixes a hard-to-hit race. The first > is a prereq for it. > > 1/17 do_add_mount(): lift lock_mount/unlock_mount into callers > 2/17 fix automount/automount race properly > 3/17 follow_automount(): get rid of dead^Wstillborn code > 4/17 follow_automount() doesn't need the entire nameidata > > part 2: unifying mount traversals in pathwalk. > > Handling of mount traversal (follow_managed()) is currently > called > in a bunch of places. Each of them is shortly followed by a call of > step_into() or an open-coded equivalent thereof. However, the > locations > of those step_into() calls are far from preceding follow_managed(); > moreover, that preceding call might happen on different paths that > converge to given step_into() call. It's harder to analyse that it > should > be (especially when it comes to liveness analysis) and it forces > rather > ugly calling conventions on > lookup_fast()/atomic_open()/lookup_open(). > The series below massages the code to the point when the calls of > follow_managed() (and __follow_mount_rcu()) move into the beginning > of > step_into(). > > 5/17 make build_open_flags() treat O_CREAT | O_EXCL as implying > O_NOFOLLOW > gets EEXIST handling in do_last() past the step_into() call > there. > 6/17 handle_mounts(): start building a sane wrapper for > follow_managed() > rather than mangling follow_managed() itself (and creating > conflicts > with openat2 series), add a wrapper that will absorb the > required > interface changes. > 7/17 atomic_open(): saner calling conventions (return dentry on > success) > struct path passed to it is pure out parameter; only dentry > part > ever varies, though - mnt is always nd->path.mnt. Just return > the dentry on success, and ERR_PTR(-E...) on failure. > 8/17 lookup_open(): saner calling conventions (return dentry on > success) > propagate the same change one level up the call chain. > 9/17 do_last(): collapse the call of path_to_nameidata() > struct path filled in lookup_open() call is eventually given to > handle_mounts(); the only use it has before that is > path_to_nameidata() > call in "->atomic_open() has actually opened it" case, and > there > path_to_nameidata() is an overkill - we are guaranteed to > replace > only nd->path.dentry. So have the struct path filled only > immediately > prior to handle_mounts(). > 10/17 handle_mounts(): pass dentry in, turn path into a pure out > argument > now all callers of handle_mount() are directly preceded by > filling > struct path it gets. path->mnt is nd->path.mnt in all cases, > so we can > pass just the dentry instead and fill path in handle_mount() > itself. > Some boilerplate gone, path is pure out argument of > handle_mount() > now. > 11/17 lookup_fast(): consolidate the RCU success case > massage to gather what will become an RCU case equivalent of > handle_mounts(); basically, that's what we do if revalidate > succeeds > in RCU case of lookup_fast(), including unlazy and fallback to > handle_mounts() if __follow_mount_rcu() says "it's too tricky". > 12/17 teach handle_mounts() to handle RCU mode > ... and take that into handle_mount() itself. The other caller > of > __follow_mount_rcu() is fine with the same fallback (it just > didn't > bother since it's in the very beginning of pathwalk), switched > to > handle_mount() as well. > 13/17 lookup_fast(): take mount traversal into callers > Now we are getting somewhere - both RCU and non-RCU success > cases of > lookup_fast() are ended with the same return > handle_mounts(...); > move that to the callers - there it will merge with the > identical calls > that had been on the paths where we had to do slow lookups. > lookup_fast() returns dentry now. > 14/17 new step_into() flag: WALK_NOFOLLOW > use step_into() instead of open-coding it in > handle_lookup_down(). > Add a flag for "don't follow symlinks regardless of > LOOKUP_FOLLOW" for > that (and eventually, I hope, for .. handling). > Now *all* calls of handle_mounts() and step_into() are right > next to > each other. > 15/17 fold handle_mounts() into step_into() > ... and we can move the call of handle_mounts() into > step_into(), > getting a slightly saner calling conventions out of that. > 16/17 LOOKUP_MOUNTPOINT: fold path_mountpointat() into > path_lookupat() > another payoff from 14/17 - we can teach path_lookupat() to do > what path_mountpointat() used to. And kill the latter, along > with > its wrappers. > 17/17 expand the only remaining call of path_lookup_conditional() > minor cleanup - RIP path_lookup_conditional(). Only one caller > left. > > At that point we run out of things that can be done without textual > conflicts > with openat2 series. Changes so far: > * mount traversal is taken into step_into(). > * lookup_fast(), atomic_open() and lookup_open() calling > conventions > are slightly changed. All of them return dentry now, instead of > returning > an int and filling struct path on success. For lookup_fast() the old > "0 for cache miss, 1 for cache hit" is replaced with "NULL stands for > cache > miss, dentry - for hit". > * step_into() can be called in RCU mode as well. Takes > nameidata, > WALK_... flags, dentry and, in RCU case, corresponding inode and seq > value. > Handles mount traversals, decides whether it's a symlink to be > followed. > Error => returns -E...; symlink to follow => returns 1, puts symlink > on stack; > non-symlink or symlink not to follow => returns 0, moves nd->path to > new location. > * LOOKUP_MOUNTPOINT introduced; user_path_mountpoint_at() and > friends > became calls of user_path_at() et.al. with LOOKUP_MOUNTPOINT in > flags. > > Next comes the merge with Aleksa's openat2 patchset; everything up to > that point > had been non-conflicting with it. That patchset has been posted > earlier; > it's in #work.openat2. The next series comes on top of the merge. > > part 3: untangling the symlink handling. > > Right now when we decide to follow a symlink it happens this > way: > * step_into() decides that it has been given a symlink that > needs to > be followed. > * it calls pick_link(), which pushes the symlink on stack and > returns 1 on success / -E... on error. Symlink's mount/dentry/seq is > stored on stack and the inode is stashed in nd->link_inode. > * step_into() passes that 1 to its callers, which proceed to > pass it > up the call chain for several layers. In all cases we get to > get_link() > call shortly afterwards. > * get_link() is called, picks the inode stashed in nd- > >link_inode > by the pick_link(), does some checks, touches the atime, etc. > * get_link() either picks the link body out of inode or calls > ->get_link(). If it's an absolute symlink, we move to the root and > return > the relative portion of the body; if it's a relative one - just > return the > body. If it's a procfs-style one, the call of nd_jump_link() has > been > made and we'd moved to whatever location is desired. And return > NULL, > same as we do for symlink to "/". > * the caller proceeds to deal with the string returned to it. > > The sequence is the same in all cases (nested symlink, trailing > symlink on lookup, trailing symlink on open), but its pieces are not > close > to each other and the bit between the call of pick_link() and > (inevitable) > call of get_link() afterwards is not easy to follow. Moreover, a > bunch > of functions (walk_component/lookup_last/do_last) ends up with the > same > conventions for return values as step_into(). And those conventions > (see above) are not pretty - 0/1/-E... is asking for mistakes, > especially > when returned 1 is used only to direct control flow on a rather > twisted > way to matching get_link() call. And that path can be seriously > twisted. > E.g. when we are trying to open /dev/stdin, we get the following > sequence: > * path_init() has put us into root and returned "/dev/stdin" > * link_path_walk() has eventually reached /dev and left > <LAST_NORM, "stdin"> in nd->last_type/nd->last > * we call do_last(), which sees that we have LAST_NORM and > calls > lookup_fast(). Let's assume that everything is in dcache; we get the > dentry of /dev/stdin and proceed to finish_lookup:, where we call > step_into() > * it's a symlink, we have LOOKUP_FOLLOW, so we decide to pick > the > damn thing. Into the stack it goes and we return 1. > * do_last() sees 1 and returns it. > * trailing_symlink() is called (in the top-level loop) and it > calls get_link(). OK, we get "/proc/self/fd/0" for body, move to > root again and return "proc/self/fd/0". > * link_path_walk() is given that string, eventually leading us > into > /proc/self/fd, with <LAST_NORM, "0"> left as the component to handle. > * do_last() is called, and similar to the previous case we > eventually reach the call of step_into() with dentry of > /proc/self/fd/0. > * _now_ we can discard /dev/stdin from the stack (we'd been > using its body until now). It's dropped (from step_into()) and we > get > to look at what we'd been given. A symlink to follow, so on the > stack > it goes and we return 1. > * again, do_last() passes 1 to caller > * trailing_symlink() is called and calls get_link(). > * this time it's a procfs symlink and its ->get_link() method > moves us to the mount/dentry of our stdin. And returns NULL. But > the > fun doesn't stop yet. > * trailing_symlink() returns "" to the caller > * link_path_walk() is called on that and does nothing > whatsoever. > * do_last() is called and sees LAST_BIND left by the > get_link(). > It calls handle_dots() > * handle_dots() drops the symlink from stack and returns > * do_last() *FINALLY* proceeds to the point after its call of > step_into() (finish_open:) and gets around to opening the damn thing. > > Making sense of the control flow through all of that is not > fun, > to put it mildly; debugging anything in that area can be a massive > PITA, > and this example has touched only one of 3 cases. Arguably, the > worst > one, but... Anyway, it turns out that this code can be massaged to > considerably saner shape - both in terms of control flow and wrt > calling > conventions. > > 1/9 merging pick_link() with get_link(), part 1 > prep work: move the "hardening" crap from trailing_symlink() > into > get_link() (conditional on the absense of LOOKUP_PARENT in nd- > >flags). > We'll be moving the calls of get_link() around quite a bit through > that > series, and the next step will be to eliminate trailing_symlink(). > 2/9 merging pick_link() with get_link(), part 2 > fold trailing_symlink() into lookup_last() and do_last(). > Now these are returning strings; it's not the final calling > conventions, > but it's almost there. NULL => old 0, we are done. ERR_PTR(-E...) > => > old -E..., we'd failed. string => old 1, and the string is the > symlink > body to follow. Just as for trailing_symlink(), "/" and procfs ones > (where get_link() returns NULL) yield "", so the ugly song and dance > with no-op trip through link_path_walk()/handle_dots() still remains. > 3/9 merging pick_link() with get_link(), part 3 > elimination of that round-trip. In *all* cases having > get_link() return NULL on such symlinks means that we'll proceed to > drop the symlink from stack and get back to the point near that > get_link() call - basically, where we would be if it hadn't been > a symlink at all. The path by which we are getting there depends > upon the call site; the end result is the same in all cases - such > symlinks (procfs ones and symlink to "/") are fully processed by > the time get_link() returns, so we could as well drop them from the > stack right in get_link(). Makes life simpler in terms of control > flow analysis... > And now the calling conventions for do_last() and lookup_last() > have reached the final shape - ERR_PTR(-E...) for error, NULL for > "we are done", string for "traverse this". > 4/9 merging pick_link() with get_link(), part 4 > now all calls of walk_component() are followed by the same > boilerplate - "if it has returned 1, call get_link() and if that > has returned NULL treat that as if walk_component() has returned 0". > Eliminate by folding that into walk_component() itself. Now > walk_component() return value conventions have joined those of > do_last()/lookup_last(). > 5/9 merging pick_link() with get_link(), part 5 > same as for the previous, only this time the boilerplate > migrates one level down, into step_into(). Only one caller of > get_link() left, step_into() has joined the same return value > conventions. > 6/9 merging pick_link() with get_link(), part 6 > move that thing into pick_link(). Now all traces of > "return 1 if we are following a symlink" are gone. > 7/9 finally fold get_link() into pick_link() > ta-da - expand get_link() into the only caller. As a side > benefit, we get rid of stashing the inode in nd->link_inode - it > was done only to carry that piece of information from pick_link() > to eventual get_link(). That's not the main benefit, though - the > control flow became considerably easier to reason about. > > For what it's worth, the example above (/dev/stdin) becomes > * path_init() has put us into root and returned "/dev/stdin" > * link_path_walk() has eventually reached /dev and left > <LAST_NORM, "stdin"> in nd->last_type/nd->last > * we call do_last(), which sees that we have LAST_NORM and > calls > lookup_fast(). Let's assume that everything is in dcache; we get the > dentry of /dev/stdin and proceed to finish_lookup:, where we call > step_into() > * it's a symlink, we have LOOKUP_FOLLOW, so we decide to pick > the > damn thing. On the stack it goes and we get its body. Which is > "/proc/self/fd/0", so we move to root and return "proc/self/fd/0". > * do_last() sees non-NULL and returns it - whether it's an > error > or a pathname to traverse, we hadn't reached something we'll be > opening. > * link_path_walk() is given that string, eventually leading us > into > /proc/self/fd, with <LAST_NORM, "0"> left as the component to handle. > * do_last() is called, and similar to the previous case we > eventually reach the call of step_into() with dentry of > /proc/self/fd/0. > * _now_ we can discard /dev/stdin from the stack (we'd been > using its body until now). It's dropped (from step_into()) and we > get > to look at what we'd been given. A symlink to follow, so on the > stack > it goes. This time it's a procfs symlink and its ->get_link() > method > moves us to the mount/dentry of our stdin. And returns NULL. So we > drop symlink from stack and return that NULL to caller. > * that NULL is returned by step_into(), same as if we had just > moved to a non-symlink. > * do_last() proceeds to open the damn thing. > > part 4. some mount traversal cleanups. > > 8/9 massage __follow_mount_rcu() a bit > make it more similar to non-RCU counterpart > 9/9 new helper: traverse_mounts() > the guts of follow_managed() are very similar to > follow_down(). The calling conventions are different > (follow_managed() > works with nameidata, follow_down() - with standalone struct path), > but the core loop is pretty much the same in both. Turned that loop > into a common helper (traverse_mounts()) and since follow_managed() > becomes a very thin wrapper around it, expand follow_managed() at its > only call site (in handle_mounts()), > > That's where the series stands right now. FWIW, at 5.5-rc1 > fs/namei.c > had been 4867 lines, at the tip of #work.openat2 - 4998, at the > tip of #work.namei (containing #work.openat2) - 4730... And IMO > the thing has become considerably easier to follow. > > What's more, it might be possible to untangle the control flow in > do_last() now. Probably a separate series, though - do_last() is > one hell of a tarpit, so I'm not stepping into it for the rest > of this cycle... >