On Mon, May 14, 2018 at 05:47:43PM +0100, Al Viro wrote: > On Mon, May 14, 2018 at 04:45:51PM +0100, Al Viro wrote: > > > > > 2) is nfserr_serverfail valid for situation when > > > > directory created by such vfs_mkdir() manages to disappear > > > > under us immediately afterwards? Or should we return nfserr_stale > > > > instead? > > > > > > We just got a successful result on the create and the parent's still > > > locked, so if somebody hits this I think we want them reporting a bug, > > > not wasting time trying to find a mistake in their own logic. > > > > No. Suppose it's NFS-over-NFS (and yes, I agree that it's a bad idea; > > somebody *has* done that). Lookup after successful mkdir can legitimately > > fail if it's been removed server-side. > > > > And we *will* need to allow nfs_mkdir() to return that way in some cases > > (== success with dentry passed to it left unhashed negative), I'm afraid ;-/ > > Consider the situation when you have NFS reexported - there's server S1, > exporting a filesystem to S2, which reexports it for client C. Thanks for the explanation! I'd missed the connection between this and the mkdir/filehandle-lookup races. Acked-by: J. Bruce Fields <bfields@xxxxxxxxxx> for the patch. --b. > > On S2, process A does stat foo, gets ENOENT and proceeds to mkdir foo. > Dentry of foo is passed to nfs_mkdir(), which calls e.g. nfs3_proc_mkdir(), > which calls nfs3_do_create(), which requests S1 to create the damn thing. > Then, after that succeeds, it calls nfs_instantiate(). There we would > proceed to get the inode and call d_add(dentry, inode). > > In the meanwhile, C, having figured out the fhandle S2 would assign to > foo (e.g. having spied upon the traffic, etc.) sends that fhandle to > S2. nfs_fh_to_dentry() is called by process B on S2 (either knfsd, or, > in case if C==S2 and attacker has done open-by-fhandle - the caller > of open_by_handle(2)). It gets the inode - the damn thing has been > created on S1 already. That inode still has no dentries attached to > it (B has just set it up), so d_obtain_alias() creates one and attaches > to it. > > Now A gets around to nfs_fhget() and finds the same inode. Voila - > d_add(dentry, inode) and we are fucked; two dentries over the same > directory inode. Fun starts very shortly when fs/exportfs/* code > is used by B to reconnect its dentry - the things really get bogus > once that constraint is violated. > > The obvious solution would be to replace that d_add() with > res = d_splice_alias(inode, dentry); > if (IS_ERR(res)) { > error = PTR_ERR(res); > goto out_error; > } > dput(res); > > leaving the dentry passed to nfs_mkdir() unhashed and negative if > we hit that race. That's fine - the next lookup (e.g. the one > done by exportfs to reconnect the sucker) will find the right > dentry in dcache; it's just that it won't the one passed to > vfs_mkdir(). > > That's different from the local case - there mkdir gets the inumber, > inserts locked in-core inode with that inumber into icache and > only then proceeds to set on-disk data up. Anyone guessing the > inumber (and thus the fhandle) will either get -ESTALE (if they > come first, as in this scenario) or find the in-core inode mkdir > is setting up and wait for it to be unlocked, at which point > d_obtain_alias() will already find it attached to dentry passed > to mkdir. > > But that critically relies upon the icache search key being known > to mkdir *BEFORE* the on-disk metadata starts looking acceptable. > For NFS we really can't do that - there the key is S1's fhandle > and we don't get that until S1 has created the damn thing. > > I don't see any variation of the trick used by local filesystems > that would not have this race; we really can end up with B getting > there first and creating directory inode with dentry attached to > it before A gets through. Which, AFAICS, leaves only one solution - > let A put the dentry created by B in place of what had been passed > to A (and leave its argument unhashed). Which is trivial to > implement (see above); however, it means that NFS itself is in > the same class as cgroup - its ->mkdir() may, in some cases, > end up succeeding and leaving its argument unhashed negative. > Note that dcache is perfectly fine after that - we have hashed > positive dentry with the right name and right parent, over the > inode for directory we'd just created; everything's fine, except > that it's not the struct dentry originally passed to vfs_mkdir().