On Jan 4, 2012, at 10:59 AM, Trond Myklebust wrote: > On Wed, 2012-01-04 at 10:53 -0500, Trond Myklebust wrote: >> On Tue, 2011-12-06 at 16:13 -0500, Chuck Lever wrote: >>> Servers have a finite amount of memory to store NFSv4 open and lock >>> owners. Moreover, servers may have a difficult time determining when >>> they can reap their state owner table, thanks to gray areas in the >>> NFSv4 protocol specification. Thus clients should be careful to reuse >>> state owners when possible. >>> >>> Currently Linux is not too careful. When a user has closed all her >>> files on one mount point, the state owner's reference count goes to >>> zero, and it is released. The next OPEN allocates a new one. A >>> workload that serially opens and closes files can run through a large >>> number of open owners this way. >>> >>> When a state owner's reference count goes to zero, slap it onto a free >>> list for that nfs_server, with an expiry time. Garbage collect before >>> looking for a state owner. This makes state owners for active users >>> available for re-use. >>> >>> Now that there can be unused state owners remaining at umount time, >>> purge the state owner free list when a server is destroyed. Also be >>> sure not to reclaim unused state owners during state recovery. >>> >>> This change has benefits for the client as well. For some workloads, >>> this approach drops the number of OPEN_CONFIRM calls from the same as >>> the number of OPEN calls, down to just one. This reduces wire traffic >>> and thus open(2) latency. Before this patch, untarring a kernel >>> source tarball shows the OPEN_CONFIRM call counter steadily increasing >>> through the test. With the patch, the OPEN_CONFIRM count remains at 1 >>> throughout the entire untar. >>> >>> As long as the expiry time is kept short, I don't think garbage >>> collection should be terribly expensive, although it does bounce the >>> clp->cl_lock around a bit. >>> >>> [ At some point we should rationalize the use of the nfs_server >>> ->destroy method. ] >>> >>> Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx> >>> --- >> <snip> >>> + >>> +static void nfs4_gc_state_owners(struct nfs_server *server) >>> +{ >>> + struct nfs_client *clp = server->nfs_client; >>> + struct nfs4_state_owner *sp, *tmp; >>> + unsigned long now = jiffies; >>> + LIST_HEAD(doomed); >>> + >>> + spin_lock(&clp->cl_lock); >>> + list_for_each_entry_safe(sp, tmp, &server->state_owners_lru, so_lru) { >>> + /* NB: LRU is sorted so that oldest is at the head */ >>> + if (time_in_range(now, sp->so_expires, >>> + sp->so_expires + clp->cl_lease_time)) >>> + break; >>> + list_move(&sp->so_lru, &doomed); >>> + } >>> + spin_unlock(&clp->cl_lock); >> >> There is a race here: the state owner is still visible in the rb-tree, >> and so a second thread that calls nfs4_get_state_owner() may still pick >> up one of these open owners that are now on your 'doomed' list. >> >> I think you rather need to do the equivalent of an >> 'nfs4_drop_state_owner' in your loop above in order to be 100% safe. >> >>> + list_for_each_entry_safe(sp, tmp, &doomed, so_lru) { >>> + list_del_init(&sp->so_lru); >>> + nfs4_release_state_owner(sp); >>> + } >>> +} >>> + >>> /** >>> * nfs4_get_state_owner - Look up a state owner given a credential >>> * @server: nfs_server to search >>> @@ -483,11 +521,13 @@ struct nfs4_state_owner *nfs4_get_state_owner(struct nfs_server *server, >>> struct nfs_client *clp = server->nfs_client; >>> struct nfs4_state_owner *sp, *new; >>> >>> + nfs4_gc_state_owners(server); >>> + > > One more thing: doesn't it make more sense to do the garbage collection > after you've looked up the state owner? Even if the open owner has been > expired on the server, you will still gain by avoiding the extra > allocation. I think "GC first" is more fair. If you really want to make sure an idle open owner is not re-used passed the sell-by, we have to do GC first. Otherwise an open owner can languish for days if there's no other activity, and will get re-used on the next open. It seems to me that on-the-wire behavior would be inconsistent if some open owners were re-used after a long idle, some were not, and the precise re-use behavior depended on workload. "GC first" gives nicely predictable behavior. > >>> spin_lock(&clp->cl_lock); >>> sp = nfs4_find_state_owner_locked(server, cred); >>> spin_unlock(&clp->cl_lock); >>> if (sp != NULL) >>> - return sp; >>> + goto out; >>> new = nfs4_alloc_state_owner(); >>> if (new == NULL) >>> return NULL; >>> @@ -502,26 +542,56 @@ struct nfs4_state_owner *nfs4_get_state_owner(struct nfs_server *server, >>> rpc_destroy_wait_queue(&new->so_sequence.wait); >>> kfree(new); >>> } >>> +out: >>> + list_del_init(&sp->so_lru); >> >> Ehem... List corruption due to no spin lock. Doesn't it make more sense >> to do the above inside nfs4_find/insert_state_owner_locked? >> >>> return sp; >>> } >>> >> >> Otherwise it looks OK... >> > > -- > Trond Myklebust > Linux NFS client maintainer > > NetApp > Trond.Myklebust@xxxxxxxxxx > www.netapp.com > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html