On Thu, 23 Apr 2015 20:52:40 +0800 Kinglong Mee <kinglongmee@xxxxxxxxx> wrote: > On 4/23/2015 7:44 AM, NeilBrown wrote: > > On Wed, 22 Apr 2015 11:07:03 -0400 "J. Bruce Fields" <bfields@xxxxxxxxxxxx> > > wrote: > >>> Reference of dentry/mnt is like a cache, avoids re-finding of them, > >>> it is necessary to store them in svc_export. > >>> > >>> Neil points out another way of 'fs_pin', I will check that. > >> > >> Yes, that'd be interesting. On a very quick look--I don't understand > >> what it's meant to do at all. But if it does provide a way to get a > >> callback on umount, that'd certainly be interesting.... > > > > Yeah, on a quick look it isn't really obvious at all. > > > > But I didn't read the code. I read > > https://lwn.net/Articles/636730/ > > > > which says: > > > > In its place is a mechanism by which an object can be added to a vfsmount > > structure (which represents a mounted filesystem); that object supports > > only one method: kill(). These "pin" objects hang around until the final > > reference to the vfsmount goes away, at which point each one's kill() function > > is called. It thus is a clean mechanism for performing cleanup when a filesystem > > goes away. > > > > This is used to close "BSD process accounting" files on umount, and sound > > like the perfect interface for flushing things from the sunrpc cache on > > umount. > > I have review those codes in fs/fs_pin.c and kernel/acct.c. > 'fs_pin' is used to fix the race between file->f_path.mnt for acct > and umount, file->f_path.mnt just holds the reference of vfsmount > but not keep busy, umount will check the reference and return busy. > > The logical is same as sunrpc cache for exports. > > Before commit 3064c3563b ("death to mnt_pinned"), acct uses a hack > method, acct get a pin to vfsmount and then put the reference, > so umount finds no other reference and callback those pins in vfsmount, > at last umount success. > > After that commit, besides pin to original vfsmount and put the reference > of the original vfsmount, acct clone the vfsmount storing in file->f_path.mnt. > > The different between those two methods is the value of file->f_path.mnt, > the first one, it contains the original vfsmnt without reference, > the second one, contains a cloned vfsmnt with reference. > > I have test the first method, pins to vfsmount for all exports cache, > nfsd can work but sometimes will delay or hang at rpc.mountd's calling > of name_to_handle_at, I can't find the reason. A kernel stack trace of exactly where it is hanging would help a lot. > > Also test the second method, there are many problem caused by the cloned > vfsmount, mnt_clone_internal() change mnt_root values to the path dentry. > > Those cases are tested for all exports cache, but, I think nfsd should > put the reference of vfsmount when finding exports upcall fail. > The success exports cache should also take it. > > The following is a draft of the first method. I think the first method sounds a lot better than the second. > > thanks, > Kinglong Mee > > ---------------------------------------------------------------------- > diff --git a/fs/fs_pin.c b/fs/fs_pin.c > index 611b540..553e8b1 100644 > --- a/fs/fs_pin.c > +++ b/fs/fs_pin.c > @@ -17,6 +17,7 @@ void pin_remove(struct fs_pin *pin) > wake_up_locked(&pin->wait); > spin_unlock_irq(&pin->wait.lock); > } > +EXPORT_SYMBOL(pin_remove); > > void pin_insert_group(struct fs_pin *pin, struct vfsmount *m, struct hlist_head *p) > { > @@ -26,11 +27,13 @@ void pin_insert_group(struct fs_pin *pin, struct vfsmount *m, struct hlist_head > hlist_add_head(&pin->m_list, &real_mount(m)->mnt_pins); > spin_unlock(&pin_lock); > } > +EXPORT_SYMBOL(pin_insert_group); > > void pin_insert(struct fs_pin *pin, struct vfsmount *m) > { > pin_insert_group(pin, m, &m->mnt_sb->s_pins); > } > +EXPORT_SYMBOL(pin_insert); > > void pin_kill(struct fs_pin *p) > { > diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c > index c3e3b6e..3e3df8c 100644 > --- a/fs/nfsd/export.c > +++ b/fs/nfsd/export.c > @@ -309,7 +309,7 @@ static void nfsd4_fslocs_free(struct nfsd4_fs_locations *fsloc) > static void svc_export_put(struct kref *ref) > { > struct svc_export *exp = container_of(ref, struct svc_export, h.ref); > - path_put(&exp->ex_path); > + pin_remove(&exp->ex_pin); > auth_domain_put(exp->ex_client); > nfsd4_fslocs_free(&exp->ex_fslocs); > kfree(exp->ex_uuid); > @@ -695,15 +695,26 @@ static int svc_export_match(struct cache_head *a, struct cache_head *b) > orig->ex_path.mnt == new->ex_path.mnt; > } > > +static void export_pin_kill(struct fs_pin *pin) > +{ > + struct svc_export *exp = container_of(pin, struct svc_export, ex_pin); > + > + write_lock(&exp->cd->hash_lock); > + cache_mark_expired(&exp->h); > + pin_remove(pin); > + write_unlock(&exp->cd->hash_lock); > +} I think you need to add some sort of delay here. The svc_export entry might still be in active use by an nfsd thread and you need to wait for that to complete. I think you need to wait for the refcnt to drop to '1'. Maybe create a global wait_queue to wait for that. Actually... svc_expkey contains a reference to a 'path' too, so you need to use pinning to purge those when the filesystem is unmounted. Oh.. and you don't want to call 'pin_remove' from export_pin_kill(). It is called from svc_export_put() and calling from both could be problematic. Hmmm... Ahhhh. If you get export_pin_kill to only call cache_mark_expired() (maybe move the locking into that function) and *not* call pin_remove(), then the pin will remain on the list and ->done will be -1. So mnt_pin_kill will loop around and this time pin_kill() will wait for p->done to be set. So we really need that thing to be removed from cache promptly. I don't think we can wait for the next time cache_clean will be run. We could call cache_flush, but I think I'd rather remove just that one entry. Also.... this requires that the pin (and the struct containing it) be freed using RCU. So: - add an rcu_head to svc_export and svc_expkey - use rcu_kfree to free both those objects - write a 'cache_force_expire()' which sets the expiry time, and then removes the entry from the cache. - Use pin_insert_group rather then mntget for both svc_export and svc_expkey - have the 'kill' functions for both call cache_force_expire(), but *not* pin_remove > + > static void svc_export_init(struct cache_head *cnew, struct cache_head *citem) > { > struct svc_export *new = container_of(cnew, struct svc_export, h); > struct svc_export *item = container_of(citem, struct svc_export, h); > > + init_fs_pin(&new->ex_pin, export_pin_kill); > kref_get(&item->ex_client->ref); > new->ex_client = item->ex_client; > new->ex_path = item->ex_path; > - path_get(&item->ex_path); > + pin_insert_group(&new->ex_pin, new->ex_path.mnt, NULL); This doesn't look right. path_get does a 'mntget()' and a 'dget()'. You are replacing 'mntget()' with 'pin_insert_group()', but I think you still need the dget(). Similarly you need a dput() up where you removed path_put(). Thanks for working on this! NeilBrown > new->ex_fslocs.locations = NULL; > new->ex_fslocs.locations_count = 0; > new->ex_fslocs.migrated = 0; > diff --git a/fs/nfsd/export.h b/fs/nfsd/export.h > index 1f52bfc..718a27e 100644 > --- a/fs/nfsd/export.h > +++ b/fs/nfsd/export.h > @@ -4,6 +4,7 @@ > #ifndef NFSD_EXPORT_H > #define NFSD_EXPORT_H > > +#include <linux/fs_pin.h> > #include <linux/sunrpc/cache.h> > #include <uapi/linux/nfsd/export.h> > > @@ -49,6 +50,7 @@ struct svc_export { > struct auth_domain * ex_client; > int ex_flags; > struct path ex_path; > + struct fs_pin ex_pin; > kuid_t ex_anon_uid; > kgid_t ex_anon_gid; > int ex_fsid; > diff --git a/include/linux/sunrpc/cache.h b/include/linux/sunrpc/cache.h > index 437ddb6..6936684 100644 > --- a/include/linux/sunrpc/cache.h > +++ b/include/linux/sunrpc/cache.h > @@ -206,6 +206,11 @@ static inline int cache_is_expired(struct cache_detail *detail, struct cache_hea > (detail->flush_time > h->last_refresh); > } > > +static inline void cache_mark_expired(struct cache_head *h) > +{ > + h->expiry_time = seconds_since_boot() - 1; > +} > + > extern int cache_check(struct cache_detail *detail, > struct cache_head *h, struct cache_req *rqstp); > extern void cache_flush(void); > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html
Attachment:
pgpS97gteUo03.pgp
Description: OpenPGP digital signature