Re: races in ll_splice_alias() and elsewhere (ext4, ocfs2)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[ext4 and ocfs2 folks Cc'd]
On Wed, Mar 09, 2016 at 11:47:50PM +0000, Drokin, Oleg wrote:

> Well, it appears there's another way of getting this in addition to
> the already identified "return NULL" at the start of the ll_lookup_nd()
> that was uncovered when reexporting via NFS.
> ll_iget_for_nfs does d_obtain_alias() that might produce a dentry also
> in need of initialization if it's a disconnected one.
> This is a path that you did not list in your possible ways to find a
> dentry in ll_find_alias() before.
> 
> Now, do you think I should just call ll_d_init() on it at all times,
> since that one might be uninitialized.

> BTW, I also did a quick look through the code and ext4 "encryption"
> code potentially got it wrong too? ext4_lookup checks if the parent
> dir have encryption and sets d_fsdata to 1 and also rewrites
> dentry ops with ext4_encrypted_d_ops, but it then does
> d_splice_alias that might find such a disconnected dentry and it would
> lose all of that work in the end?

Umm...  AFAICS, ext4_d_revalidate() is racy, starting with the very
first line.  What's to prevent it being moved while we are calling that?
Lose timeslice on preemption, have mv(1) move it elsewhere, followed by
rmdir taking the now-empty parent out.  Come back and dir points to
freed memory, with ci being complete junk.  Looks like oopsen galore...
Ted, am I missing something subtle here?

d_splice_alias() aside (nfsd interaction will be interesting, though),
this looks like an oopsable race. If not worse - dereferencing some
addresses can really screw hardware...

> ocfs2 reimplements d_splice_alias as ocfs2_find_local_alias where they
> sidestep the initialization of d_fsdata issue (even though they
> refer to it as d_splice_alias elsewhere in the comments).

Not really.  One of the callers (ocfs2_lookup() -> ocfs2_dentry_attach_lock() 
-> ocfs2_find_local_alias()) does that after d_splice_alias(); it doesn't
reimplement d_splice_alias().  There inode is already equal to dentry->d_inode.
In other callers of ocfs2_dentry_attach_lock() we are yet to set ->d_inode,
but already know its value (link/mknod/symlink/reflink).  We do *not* set
it in that function - it's done afterwards (look for d_instantiate()
downstream).  Yet another (cross-directory rename) does that to dentry with
dentry->d_inode already set and equal to inode.  IOW, it's not even close
to d_splice_alias().

> This would cause them to always reject disconnected dentries from
> being found at the very least which is probably somewhat suboptimal
> and leads to some wasted RAM for those extra dentries.

No.  d_splice_alias() in ocfs2_lookup() will pick those just fine.  _Then_
it will go looking for other links in the same directory, and try to share
their (refcounted) ->d_fsdata.  If not found, it'll just alloc a new one -
ocfs2_dentry_lock, not dentry.  Rightfully so, since they are basically
indexed by (parent inumber, child inumber) pairs, so links elsewhere are
out of consideration anyway.

I'm not sure if delaying ->d_fsdata setting until after d_splice_alias() is
such a good idea - it can be picked by dcache lookups as soon as it has
been hashed, so if somebody finds in dcache just as ocfs2_lookup() has
finished d_splice_alias(), we'll end up with ->d_revalidate() called and
possibly finding still NULL ->d_fsdata.  With caller of ->d_revalidate()
unhashing it there and then.  

The same ->d_revalidate() can also cause an unpleasant problem for e.g.
mknod - it'll see ->d_inode still negative, while ->d_fsdata is already
switched from "generation number of parent" to "pointer to ocfs2_dentry_lock",
notice that it does not match the generation number of parent and, again,
kick it out of hash.

Hell knows; I'd probably try to use the LSB of ->d_fsdata to tell one from
another.  ocfs2_dentry_lock is going to be at least 32bit-aligned, so we
could use odd values for ->d_fsdata of negatives.  Always flip to pointer
before making dentry positive and have ->d_revalidate() treat "NULL ->d_inode
and even ->d_fsdata" as "valid negative" - after all, it must have been
just prior to that and dentry *is* negative.

Note, BTW, that d_splice_alias() will not look for aliases in case of
non-directories - for those it's the same d_add(), since there we can
legitimately have many dentry aliases over the same inode.  For directories
we *can't*.

After looking at that code...  ocfs2_match_dentry() is mostly junk.
	a) dentry *never* has NULL ->d_parent; disconnected ones have
->d_parent point to dentry itself.
	b) ->d_parent->d_inode is never NULL.  If it ever happens, we
are really screwed in so many places...
	c) parent_blkno thing is very dubious - if it hadn't just come
from the actual in-core parent inode (and in that case, why not compare
that?), it's pretty much begging for iput() races, _especially_ if they
have async code using it.  For the call chains going through
ocfs2_dentry_attach_lock() the former is true; not sure about the one in
ocfs2_dentry_convert_worker()...

(c) might actually be correct; I hadn't looked in enough details to tell.
(a) and (b) are absolute crap.  BTW, the loop in ocfs2_dentry_convert_worker()
smells like O(busy aliases^2) ;-/

> With this (and the other one) fixes in place I see no other problems
> with your patch.
> 
> So if you think the patch below is ok, I can submit it to Greg
> (with addition of removing the extra init after ll_find_alias call),
> or to you if you think it's better for you to carry it in your tree
> (the one below probably has whitespace all wrong, so it won't really
> apply as is and needs a proper resend).

I'll pick it.  Mind if I fold it into the one I'd posted (with credits,
obviously)?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux