Re: [PATCH RFC] NFSD: fix cannot umounting mount points under pseudo root

NeilBrown <neilb@xxxxxxx> · Wed, 29 Apr 2015 12:57:28 +1000

On Mon, 27 Apr 2015 20:11:48 +0800 Kinglong Mee <kinglongmee@xxxxxxxxx> wrote:

> On 4/24/2015 11:00 AM, NeilBrown wrote:
> > On Thu, 23 Apr 2015 20:52:40 +0800 Kinglong Mee <kinglongmee@xxxxxxxxx> wrote:
> > 
> >> On 4/23/2015 7:44 AM, NeilBrown wrote:
> >>> On Wed, 22 Apr 2015 11:07:03 -0400 "J. Bruce Fields" <bfields@xxxxxxxxxxxx>
> >>> wrote:
> >>>>> Reference of dentry/mnt is like a cache, avoids re-finding of them,
> >>>>> it is necessary to store them in svc_export.
> >>>>>
> >>>>> Neil points out another way of 'fs_pin', I will check that.
> >>>>
> >>>> Yes, that'd be interesting.  On a very quick look--I don't understand
> >>>> what it's meant to do at all.  But if it does provide a way to get a
> >>>> callback on umount, that'd certainly be interesting....
> >>>
> >>> Yeah, on a quick look it isn't really obvious at all.
> >>>
> >>> But I didn't read the code.  I read
> >>>  https://lwn.net/Articles/636730/
> >>>
> >>> which says:
> >>>
> >>>     In its place is a mechanism by which an object can be added to a vfsmount
> >>>     structure (which represents a mounted filesystem); that object supports 
> >>>     only one method: kill(). These "pin" objects hang around until the final
> >>>     reference to the vfsmount goes away, at which point each one's kill() function
> >>>     is called. It thus is a clean mechanism for performing cleanup when a filesystem
> >>>     goes away.
> >>>
> >>> This is used to close "BSD process accounting" files on umount, and sound
> >>> like the perfect interface for flushing things from the sunrpc cache on
> >>> umount.
> >>
> >> I have review those codes in fs/fs_pin.c and kernel/acct.c.
> >> 'fs_pin' is used to fix the race between file->f_path.mnt for acct
> >> and umount, file->f_path.mnt just holds the reference of vfsmount
> >> but not keep busy, umount will check the reference and return busy.
> >>
> >> The logical is same as sunrpc cache for exports.
> >>
> >> Before commit 3064c3563b ("death to mnt_pinned"), acct uses a hack
> >> method, acct get a pin to vfsmount and then put the reference,
> >> so umount finds no other reference and callback those pins in vfsmount,
> >> at last umount success.
> >>
> >> After that commit, besides pin to original vfsmount and put the reference
> >> of the original vfsmount, acct clone the vfsmount storing in file->f_path.mnt.
> >>
> >> The different between those two methods is the value of file->f_path.mnt,
> >> the first one, it contains the original vfsmnt without reference,
> >> the second one, contains a cloned vfsmnt with reference.
> >>
> >> I have test the first method, pins to vfsmount for all exports cache,
> >> nfsd can work but sometimes will delay or hang at rpc.mountd's calling
> >> of name_to_handle_at, I can't find the reason.
> > 
> > A kernel stack trace of exactly where it is hanging would help a lot.
> 
> Witch adding fs_pin to exports, it seems there is a mutex race between
> nfsd and rpc.mountd for locking parent inode.
> 
> Apr 27 19:57:03 ntest kernel: rpc.mountd      D ffff88006ac5fc28     0  1152      1 0x00000000
> Apr 27 19:57:03 ntest kernel: ffff88006ac5fc28 ffff880068c38970 ffff880068c3de60 ffff88006ac5fc08
> Apr 27 19:57:03 ntest kernel: ffff88006ac60000 ffff880035ecf694 ffff880068c3de60 00000000ffffffff
> Apr 27 19:57:03 ntest kernel: ffff880035ecf698 ffff88006ac5fc48 ffffffff8177ffd7 0000000000000000
> Apr 27 19:57:03 ntest kernel: Call Trace:
> Apr 27 19:57:03 ntest kernel: [<ffffffff8177ffd7>] schedule+0x37/0x90
> Apr 27 19:57:03 ntest kernel: [<ffffffff8178031e>] schedule_preempt_disabled+0xe/0x10
> Apr 27 19:57:03 ntest kernel: [<ffffffff81781e92>] __mutex_lock_slowpath+0xb2/0x120
> Apr 27 19:57:03 ntest kernel: [<ffffffff81781f23>] mutex_lock+0x23/0x40
> Apr 27 19:57:03 ntest kernel: [<ffffffff81230714>] lookup_slow+0x34/0xc0
> Apr 27 19:57:03 ntest kernel: [<ffffffff812358ee>] path_lookupat+0x89e/0xc60
> Apr 27 19:57:03 ntest kernel: [<ffffffff81205192>] ? kmem_cache_alloc+0x1e2/0x260
> Apr 27 19:57:03 ntest kernel: [<ffffffff812367a6>] ? getname_flags+0x56/0x200
> Apr 27 19:57:03 ntest kernel: [<ffffffff81235cd7>] filename_lookup+0x27/0xc0
> Apr 27 19:57:03 ntest kernel: [<ffffffff81237b53>] user_path_at_empty+0x63/0xd0
> Apr 27 19:57:03 ntest kernel: [<ffffffff8121d722>] ? put_object+0x32/0x60
> Apr 27 19:57:03 ntest kernel: [<ffffffff8121d94d>] ? delete_object_full+0x2d/0x40
> Apr 27 19:57:03 ntest kernel: [<ffffffff8123e273>] ? dput+0x33/0x230
> Apr 27 19:57:03 ntest kernel: [<ffffffff81237bd1>] user_path_at+0x11/0x20
> Apr 27 19:57:03 ntest kernel: [<ffffffff81286489>] SyS_name_to_handle_at+0x59/0x200
> Apr 27 19:57:03 ntest kernel: [<ffffffff81783f6e>] system_call_fastpath+0x12/0x71
> Apr 27 19:57:03 ntest kernel: nfsd            S ffff88006e92b708     0  1170      2 0x00000000
> Apr 27 19:57:03 ntest kernel: ffff88006e92b708 ffffffff81c12480 ffff88006acbcb80 ffff88006e92b758
> Apr 27 19:57:03 ntest kernel: ffff88006e92c000 ffffffff82290040 ffff88006e92b758 ffffffff82290040
> Apr 27 19:57:03 ntest kernel: 00000000fffffff5 ffff88006e92b728 ffffffff8177ffd7 0000000100043a09
> Apr 27 19:57:03 ntest kernel: Call Trace:
> Apr 27 19:57:03 ntest kernel: [<ffffffff8177ffd7>] schedule+0x37/0x90
> Apr 27 19:57:03 ntest kernel: [<ffffffff81782e97>] schedule_timeout+0x117/0x230
> Apr 27 19:57:03 ntest kernel: [<ffffffff8110b240>] ? internal_add_timer+0xb0/0xb0
> Apr 27 19:57:03 ntest kernel: [<ffffffff81780ff3>] wait_for_completion_interruptible_timeout+0xf3/0x150
> Apr 27 19:57:03 ntest kernel: [<ffffffff810cb090>] ? wake_up_state+0x20/0x20
> Apr 27 19:57:03 ntest kernel: [<ffffffffa01350cc>] cache_wait_req.isra.10+0x9c/0x110 [sunrpc]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa0134760>] ? sunrpc_init_cache_detail+0xc0/0xc0 [sunrpc]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa0135c54>] cache_check+0x1d4/0x380 [sunrpc]
> Apr 27 19:57:03 ntest kernel: [<ffffffff8133248c>] ? inode_doinit_with_dentry+0x48c/0x6a0
> Apr 27 19:57:03 ntest kernel: [<ffffffffa045e132>] exp_get_by_name+0x82/0xb0 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffff81776d0a>] ? kmemleak_free+0x3a/0xa0
> Apr 27 19:57:03 ntest kernel: [<ffffffff81332210>] ? inode_doinit_with_dentry+0x210/0x6a0
> Apr 27 19:57:03 ntest kernel: [<ffffffff813326bc>] ? selinux_d_instantiate+0x1c/0x20
> Apr 27 19:57:03 ntest kernel: [<ffffffff8123def7>] ? _d_rehash+0x37/0x40
> Apr 27 19:57:03 ntest kernel: [<ffffffff8123f776>] ? d_splice_alias+0xa6/0x2d0
> Apr 27 19:57:03 ntest kernel: [<ffffffff812be90b>] ? ext4_lookup+0xdb/0x160
> Apr 27 19:57:03 ntest kernel: [<ffffffffa0460114>] rqst_exp_get_by_name+0x64/0x140 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa045a8f6>] nfsd_cross_mnt+0x76/0x1b0 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa0475300>] nfsd4_encode_dirent+0x160/0x3d0 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa04751a0>] ? nfsd4_encode_getattr+0x40/0x40 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa045c631>] nfsd_readdir+0x1c1/0x2a0 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa045a1e0>] ? nfsd_direct_splice_actor+0x20/0x20 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa046bd70>] nfsd4_encode_readdir+0x120/0x220 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa047573d>] nfsd4_encode_operation+0x7d/0x190 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa046a51d>] nfsd4_proc_compound+0x24d/0x6f0 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa0455f83>] nfsd_dispatch+0xc3/0x220 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa012a01b>] svc_process_common+0x43b/0x690 [sunrpc]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa012b133>] svc_process+0x103/0x1b0 [sunrpc]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa045596f>] nfsd+0xff/0x170 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffffa0455870>] ? nfsd_destroy+0x80/0x80 [nfsd]
> Apr 27 19:57:03 ntest kernel: [<ffffffff810bd378>] kthread+0xd8/0xf0
> Apr 27 19:57:03 ntest kernel: [<ffffffff810bd2a0>] ? kthread_worker_fn+0x180/0x180
> Apr 27 19:57:03 ntest kernel: [<ffffffff81784362>] ret_from_fork+0x42/0x70
> Apr 27 19:57:03 ntest kernel: [<ffffffff810bd2a0>] ? kthread_worker_fn+0x180/0x180
> 

Hmm... That looks bad.  I wonder why we haven't hit the deadlock before.

nfsd gets a LOOKUP request for $DIR/name which is a mountpoint, or a READDIR
request for $DIR which contains 'name' which is a mountpoint.

nfsd_lookup_dentry of nfsd_buffered_readdir lock i_mutex on $DIR, then do
the mountpoint check.  If the mounted filesystem is not exported, but the $DIR
filesystem has 'crossmnt', an upcall to mountd asks for export information.
When mountd try to do a lookup on the name, it works fine if lookup_fast finds
the child, but it deadlocks on the parent's i_mutex if it has to go to
lookup_slow.

It should only go to lookup_slow() if:
  - the child isn't in cache .. but as it is mounted on, it must be
  - d_revalidate returns 0

This probably means that $DIR is on an NFS filesystem and it failed the
revalidation for some reason....

Kinglong: can you reproduce this easily?  If so can you enable nfs debugging 
and collect the output?
I want to confirm that the dfprintk in nfs_lookup_revalidate is saying that
the name is invalid.

In any case, holding a mutex while waiting for an upcall is probably a bad
idea.  We should try to find a way to work around that...
Any ideas Bruce ?-)

NeilBrown

Attachment:
pgp0eQoEdv7Tk.pgp

Description: OpenPGP digital signature