Re: [PATCH 51/79] namei: remove restrictions on nesting depth

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Wed, 6 May 2015 17:22:30 +0100

On Wed, May 06, 2015 at 08:24:07AM +0100, Al Viro wrote:
> On Wed, May 06, 2015 at 12:55:28PM +1000, NeilBrown wrote:
> 
> > > -	if (unlikely(current->total_link_count >= 40)) {
> > > +	if (unlikely(current->total_link_count >= MAXSYMLINKS)) {
> > 
> > There is still a literal '40' in follow_automount.
> > 
> >         current->total_link_count++;
> >         if (current->total_link_count >= 40)
> >                 return -ELOOP;
> > 
> > 
> > should that become MAXSYMLINKS too?
> 
> Yes, assuming we want to keep it at all...

To elaborate a bit - the reason why we count those among the symlink crossings
is historical; we used to implement automounting via ->follow_link().

Note that it's not really consistent - once a process has triggered automount,
subsequent lookups crossing that point *won't* have it counted as link
traversal.  Except for ones that come while automount attempt is in progress,
etc.

Limit on link traversals makes obvious sense wrt avoiding loops and DoS -
reaching a symlink in effect is adding a pile of path components to the
path yet to be traversed.  Automount points do nothing of that sort -
the pending path remains as-is.

So I'm not sure it makes sense to have those contribute to the same counter.
OTOH, there _is_ an unpleasant problem around follow_automount() - it's
a place where we call a method with heavy stack footprint still fairly
deep in stack; in mainline it can get as bad as ~2.8K deep, with this tree
the worst call chain entirely inside fs/namei.c is
	SyS_renameat2 -> user_path_parent -> filename_lookup -> path_lookupat ->
	path_init -> link_path_walk -> walk_component -> lookup_fast ->
	follow_managed -> ->d_automount
and it costs 1.1K; slightly more shallow chains lead from linkat(2) and
do_filp_open()/do_file_open_root().  Again, mainline is more than two times
worse on those.

->d_manage() gets hit on the same depth, ->d_revalidate() at a bit less than
that.

FWIW, the sums on a fairly sane amd64 config are
[->d_manage] 1104
[->d_automount] 1104
[->d_revalidate] 1024
[->lookup] 992
[->put_link] 896
[->permission] 832
[->follow_link] 768
[->d_hash] 768
[->d_weak_revalidate] 624
[->create] 544
[->tmpfile] 480
[->atomic_open] 480
[->rename] 336
[->rename2] 336
[->link] 240
[->unlink] 224
[->rmdir] 176
[->mknod] 160
[->symlink] 144
[->mkdir] 144

And ->d_automount() is easily the heaviest of that bunch in terms of stack
footprint.  So much that I really wonder if we ought to use something like
schedule_work() and (interruptibly) wait for completion; the interesting
question here is how much of the initiator's state is needed by worker.
The real namespace isn't - actual mounting is in finish_automount().
Credentials probably are; what about netns?  Suppose a process steps on
NFS4 referral point and triggers a new NFS mount; which netns should be
used - that of the triggering process or that of the NFS mount where we'd
run into the referral?  Ditto for AFS and CIFS...

BTW, if you want to torture the whole thing wrt stack footprint, try to make
/lib/firmware a symlink that would lead you through a referral point (on
the mainline - do so at the maximal nesting depth).  The thing is,
request_firmware() will chase that one *on* *caller's* *stack*.  Seeing that
we do have an async variant anyway (request_firmware_nowait()), could we
have request_firmware() itself go through schedule_work() (and wait for
completion, of course)?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html