Re: [GIT PULL] Detaching mounts on unlink for 3.15-rc1

ebiederm@xxxxxxxxxxxx (Eric W. Biederman) · Wed, 09 Apr 2014 02:02:22 -0700

Al Viro <viro@xxxxxxxxxxxxxxxxxx> writes:

> On Wed, Apr 09, 2014 at 03:30:27AM +0100, Al Viro wrote:
>
>> > When renaming or unlinking directory entries that are not mountpoints
>> > no additional locks are taken so no performance differences can result,
>> > and my benchmark reflected that.
>> 
>> It also means that d_invalidate() now might trigger fs shutdown.  Which
>> has bloody huge stack footprint, for obvious reasons.  And d_invalidate()
>> can be called with pretty deep stack - walk into wrong dentry while
>> resolving a deeply nested symlink and there you go...
>
> PS: I thought I actually replied with that point back a month or so ago,
> but having checked sent-mail...  Looks like I had not.  My deep
> apologies.

If I had been aware of the concern I would have addressed this a month
ago.  Oh well that is water under the bridge now.

> FWIW, I think that overall this thing is a good idea, provided that we can
> live with semantics changes.  The implementation is too optimistic, though -
> at the very least, we want this work done upon namespace_unlock() held
> back until we are not too deep in stack.  task_work_add() fodder, perhaps?

Given where we are at in the merge window I need to make the argument
that it makes sense to pull this code now and address the deep
stack concerns before 3.15-final.

As I understand it your concern with d_invalidate is that there will be
a remote filesystem like nfs, and on a directory of that filesystem we
have mounted some other filesystem.  Someone on the nfs server deletes
the directory that is our mountpint.  Sometime later we traverse the
path that leads to the mountpoint in question and call d_invalidate.
The d_invalidate call drops performs a lazy unmount drops the last
filesystem reference and the filesystem umount potentially overflows
the kernel stack.

I would like to argue that because it is stupid to mount something on a
directory that someone else can delete this is not a particularly
likely scenario.  Unfortunately an evil user can mount filesystems
tagged with FS_USERNS_MOUNT and with care can trigger this at will. :(
So if we can overflow the stack this is problem that must be fixed.

The good news is that because it stupid to have a setup where someone
else can delete your mountpoint this won't really be a problem except in
setups with evil users which means merging these changes now, and fixing
this final issue in a few days when a tested patch is available should
not cause any problems.

Looking at the code there are two points where we could make this lazy,
walking the unmounted list in namespace unlock, and simply calling
detach_mount (As dropping a dentry and holid a reference is perfectly
legitimate).   I don't see any fundamental difficulties just some
painstaking detail work, to make certain the resulting implemention is
correct.

Linus would you please merge my branch.  In it's current state the code
fixes a lot of semantic and implementation bugs.  With only the
potential issue of overflowing that stack on a code path that most
people should never trigger.   Having the code merged for 3.15 will free
me to focus on the technical stack depth concerns and allow me to stop
worry about all of the other potential issues the patchset touches on.

Eric

p.s. Now I am off to sleep before I propose a patch to deal with the
potentail of very deep stacks.  I think I can reuse mnt_rcu in struct
mount for the work struct to pass to task_work_add but I need to look
closer at all of the details involved.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html