Re: [PATCH RFC] NFSD: fix cannot umounting mount points under pseudo root

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 23 Apr 2015 20:52:40 +0800 Kinglong Mee <kinglongmee@xxxxxxxxx> wrote:

> On 4/23/2015 7:44 AM, NeilBrown wrote:
> > On Wed, 22 Apr 2015 11:07:03 -0400 "J. Bruce Fields" <bfields@xxxxxxxxxxxx>
> > wrote:
> >>> Reference of dentry/mnt is like a cache, avoids re-finding of them,
> >>> it is necessary to store them in svc_export.
> >>>
> >>> Neil points out another way of 'fs_pin', I will check that.
> >>
> >> Yes, that'd be interesting.  On a very quick look--I don't understand
> >> what it's meant to do at all.  But if it does provide a way to get a
> >> callback on umount, that'd certainly be interesting....
> > 
> > Yeah, on a quick look it isn't really obvious at all.
> > 
> > But I didn't read the code.  I read
> >  https://lwn.net/Articles/636730/
> > 
> > which says:
> > 
> >     In its place is a mechanism by which an object can be added to a vfsmount
> >     structure (which represents a mounted filesystem); that object supports 
> >     only one method: kill(). These "pin" objects hang around until the final
> >     reference to the vfsmount goes away, at which point each one's kill() function
> >     is called. It thus is a clean mechanism for performing cleanup when a filesystem
> >     goes away.
> > 
> > This is used to close "BSD process accounting" files on umount, and sound
> > like the perfect interface for flushing things from the sunrpc cache on
> > umount.
> 
> I have review those codes in fs/fs_pin.c and kernel/acct.c.
> 'fs_pin' is used to fix the race between file->f_path.mnt for acct
> and umount, file->f_path.mnt just holds the reference of vfsmount
> but not keep busy, umount will check the reference and return busy.
> 
> The logical is same as sunrpc cache for exports.
> 
> Before commit 3064c3563b ("death to mnt_pinned"), acct uses a hack
> method, acct get a pin to vfsmount and then put the reference,
> so umount finds no other reference and callback those pins in vfsmount,
> at last umount success.
> 
> After that commit, besides pin to original vfsmount and put the reference
> of the original vfsmount, acct clone the vfsmount storing in file->f_path.mnt.
> 
> The different between those two methods is the value of file->f_path.mnt,
> the first one, it contains the original vfsmnt without reference,
> the second one, contains a cloned vfsmnt with reference.
> 
> I have test the first method, pins to vfsmount for all exports cache,
> nfsd can work but sometimes will delay or hang at rpc.mountd's calling
> of name_to_handle_at, I can't find the reason.

A kernel stack trace of exactly where it is hanging would help a lot.


> 
> Also test the second method, there are many problem caused by the cloned
> vfsmount, mnt_clone_internal() change mnt_root values to the path dentry.
> 
> Those cases are tested for all exports cache, but, I think nfsd should
> put the reference of vfsmount when finding exports upcall fail.
> The success exports cache should also take it.
> 
> The following is a draft of the first method.

I think the first method sounds a lot better than the second.

> 
> thanks,
> Kinglong Mee
> 
> ----------------------------------------------------------------------
> diff --git a/fs/fs_pin.c b/fs/fs_pin.c
> index 611b540..553e8b1 100644
> --- a/fs/fs_pin.c
> +++ b/fs/fs_pin.c
> @@ -17,6 +17,7 @@ void pin_remove(struct fs_pin *pin)
>  	wake_up_locked(&pin->wait);
>  	spin_unlock_irq(&pin->wait.lock);
>  }
> +EXPORT_SYMBOL(pin_remove);
>  
>  void pin_insert_group(struct fs_pin *pin, struct vfsmount *m, struct hlist_head *p)
>  {
> @@ -26,11 +27,13 @@ void pin_insert_group(struct fs_pin *pin, struct vfsmount *m, struct hlist_head
>  	hlist_add_head(&pin->m_list, &real_mount(m)->mnt_pins);
>  	spin_unlock(&pin_lock);
>  }
> +EXPORT_SYMBOL(pin_insert_group);
>  
>  void pin_insert(struct fs_pin *pin, struct vfsmount *m)
>  {
>  	pin_insert_group(pin, m, &m->mnt_sb->s_pins);
>  }
> +EXPORT_SYMBOL(pin_insert);
>  
>  void pin_kill(struct fs_pin *p)
>  {
> diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
> index c3e3b6e..3e3df8c 100644
> --- a/fs/nfsd/export.c
> +++ b/fs/nfsd/export.c
> @@ -309,7 +309,7 @@ static void nfsd4_fslocs_free(struct nfsd4_fs_locations *fsloc)
>  static void svc_export_put(struct kref *ref)
>  {
>  	struct svc_export *exp = container_of(ref, struct svc_export, h.ref);
> -	path_put(&exp->ex_path);
> +	pin_remove(&exp->ex_pin);
>  	auth_domain_put(exp->ex_client);
>  	nfsd4_fslocs_free(&exp->ex_fslocs);
>  	kfree(exp->ex_uuid);
> @@ -695,15 +695,26 @@ static int svc_export_match(struct cache_head *a, struct cache_head *b)
>  		orig->ex_path.mnt == new->ex_path.mnt;
>  }
>  
> +static void export_pin_kill(struct fs_pin *pin)
> +{
> +	struct svc_export *exp = container_of(pin, struct svc_export, ex_pin);
> +
> +	write_lock(&exp->cd->hash_lock);
> +	cache_mark_expired(&exp->h);
> +	pin_remove(pin);
> +	write_unlock(&exp->cd->hash_lock);
> +}

I think you need to add some sort of delay here.  The svc_export entry might
still be in active use by an nfsd thread and you need to wait for that to
complete.

I think you need to wait for the refcnt to drop to '1'.  Maybe create a
global wait_queue to wait for that.

Actually... svc_expkey contains a reference to a 'path' too, so you need to
use pinning to purge those when the filesystem is unmounted.

Oh.. and you don't want to call 'pin_remove' from export_pin_kill().  It is
called from svc_export_put() and calling from both could be problematic.

Hmmm...  Ahhhh.

If you get export_pin_kill to only call cache_mark_expired() (maybe move the
locking into that function) and *not* call pin_remove(), then the pin will
remain on the list and ->done will be -1.
So mnt_pin_kill will loop around and this time pin_kill() will wait for
p->done to be set.
So we really need that thing to be removed from cache promptly.  I don't
think we can wait for the next time cache_clean will be run.  We could
call cache_flush, but I think I'd rather remove just that one entry.

Also.... this requires that the pin (and the struct containing it) be freed
using RCU.

So:
 - add an rcu_head to svc_export and svc_expkey
 - use rcu_kfree to free both those objects
 - write a 'cache_force_expire()' which sets the expiry time, and then
   removes the entry from the cache.
 - Use pin_insert_group rather then mntget for both svc_export and svc_expkey
 - have the 'kill' functions for both call cache_force_expire(), but *not*
   pin_remove


> +
>  static void svc_export_init(struct cache_head *cnew, struct cache_head *citem)
>  {
>  	struct svc_export *new = container_of(cnew, struct svc_export, h);
>  	struct svc_export *item = container_of(citem, struct svc_export, h);
>  
> +	init_fs_pin(&new->ex_pin, export_pin_kill);
>  	kref_get(&item->ex_client->ref);
>  	new->ex_client = item->ex_client;
>  	new->ex_path = item->ex_path;
> -	path_get(&item->ex_path);
> +	pin_insert_group(&new->ex_pin, new->ex_path.mnt, NULL);

This doesn't look right.  path_get does a 'mntget()' and a 'dget()'.
You are replacing 'mntget()' with 'pin_insert_group()', but I think you still
need the dget().

Similarly you need a dput() up where you removed path_put().


Thanks for working on this!

NeilBrown



>  	new->ex_fslocs.locations = NULL;
>  	new->ex_fslocs.locations_count = 0;
>  	new->ex_fslocs.migrated = 0;
> diff --git a/fs/nfsd/export.h b/fs/nfsd/export.h
> index 1f52bfc..718a27e 100644
> --- a/fs/nfsd/export.h
> +++ b/fs/nfsd/export.h
> @@ -4,6 +4,7 @@
>  #ifndef NFSD_EXPORT_H
>  #define NFSD_EXPORT_H
>  
> +#include <linux/fs_pin.h>
>  #include <linux/sunrpc/cache.h>
>  #include <uapi/linux/nfsd/export.h>
>  
> @@ -49,6 +50,7 @@ struct svc_export {
>  	struct auth_domain *	ex_client;
>  	int			ex_flags;
>  	struct path		ex_path;
> +	struct fs_pin		ex_pin;
>  	kuid_t			ex_anon_uid;
>  	kgid_t			ex_anon_gid;
>  	int			ex_fsid;
> diff --git a/include/linux/sunrpc/cache.h b/include/linux/sunrpc/cache.h
> index 437ddb6..6936684 100644
> --- a/include/linux/sunrpc/cache.h
> +++ b/include/linux/sunrpc/cache.h
> @@ -206,6 +206,11 @@ static inline int cache_is_expired(struct cache_detail *detail, struct cache_hea
>  		(detail->flush_time > h->last_refresh);
>  }
>  
> +static inline void cache_mark_expired(struct cache_head *h)
> +{
> +	h->expiry_time = seconds_since_boot() - 1;
> +}
> +
>  extern int cache_check(struct cache_detail *detail,
>  		       struct cache_head *h, struct cache_req *rqstp);
>  extern void cache_flush(void);
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Attachment: pgpS97gteUo03.pgp
Description: OpenPGP digital signature


[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux