Al Viro <viro@xxxxxxxxxxxxxxxxxx> writes: > On Mon, Aug 03, 2015 at 04:27:34PM -0500, Eric W. Biederman wrote: > >> - The escape count on struct mount must be incremented both before the >> rename and after. If the count is not incremented before the rename >> it is possible to hit a scenario where the rename happens the code >> walks up the directory tree to somewhere outside of the bind mount >> before the count is touched. Similary without a count after the >> rename it is possible for the code to look at the escape count >> validate a path is connected before the rename and assume cache the >> escape count, leading to not retesting the path is ok. > > Umm... I wonder if you are overcomplicating the things here. Sure, > I understand wanting to reduce the checks on "..", but... It costs you > considerable complexity (especially when it comes to 64bit counts), > it's really brittle (you need to be very careful about the places where > you zero the cached values in fs/namei.c and missing one will lead to > really unpleasant effects there) _and_ it's all for the benefit of > a very rare case. With check you are optimizing away not being all that > costly anyway. I had to give this a long hard think. Algorithms going to O(N^2) when it is uncessarry really bother me. I ran some numbers for really deep directory trees, slow memory, etc and I could not come up with a scenario where even in it's worst case d_ancestor would take anywhere near as long as a one disk seek, and most of the d_ancestor would be much quicker. So it appears to me that in the worst case a pathname lookup consisting of a ridiculous number of .. components starting with a cold cache, on a mount where a directory has escaped is likely to be faster than a similar lookup going down the tree with many disk seeks. I don't think the 64bit counts and the zeroing the cache values are quite as bad as you make out. There are much trickier things already in path name lookup code. But I do agree that it is easy to get wrong because nothing will show up in testing, and getting it wrong will have really unpleasant effects. I also can't see a scenario where a directory would escape a subtree that is mounted somewhere without it being a misconfiguration. So I agree it is not worth it to optimize the code so that there are an absolute minimum number of d_ancestor calls during pathname lookup. Further replacing mnt_escape_count with a mnt_flag makes the code much simpler. Which I very much appreciate. >> - The largest change is in d_unalias, where the two cases are split >> apart so they can be handled separately. In the easy case of a >> rename within the same directory all that is needed is __d_move >> (escaping a mount is impossible in that case). In the more involved >> case mutexes need to be acquired, and now the spin locks need to be >> dropped so that proper lock aquisition order around __d_move can be >> arranged. > > I _really_ hate that part. Could you explain WTF is wrong with simply > taking mount_lock in that case of __d_splice_alias() just outside of > rename_lock? Me too. So upon realizing the that inode->i_lock is held longer than necessary in d_splice_alias I reworked the locking in d_splice_alias. Updated patches to follow in a little bit. > PS: that thing should be in fs/dcache.c, at least in the part that > deals with finding the common ancestor, etc. And __d_move() (and > dentry_lock_for_move()) games with d_ancestor() should be redundant > now. It does seem reasonable that the BUG_ONs in __d_move that call d_ancestor can be removed, or simplified by passing the common ancestor into __d_move. I don't know the code in dentry_lock_for_move well enough to say anything except that the d_ancestor call in dentry_lock_for_move looks reasonable. Doing anything inside of __d_move or dentry_lock_for_move appears to be a detour to the cause of preventing escaping from bind mounts. So while I have no problems with the with the kinds of changes I hear you suggesting, but unless I encounter something that makes changing __d_move or dentry_lock_for_more relevant to the work of preventing escaping from bind mounts I don't plan on touching them while that is my focus. Eric _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers