G'day, This is a pair of patches which attempt to make the sunrpc caching mechanism more graceful with respect to administrative exportfs changes from userspace while traffic is flowing to other exports. The patches are not to be applied and are for comment, to allow someone else to finish this work, and to document the problem. The patches are incomplete and not tested. The patches are against the archaic SLES10 versions of the kernel and nfs-utils and will need some work for modern software. Here's a description of the problem. The current behaviour when an export is changed (added, removed, or the flags are changed) using exportfs is that exporfs writes a new etab and then tells the kernel to flush the entire kernel-side export state (including the state for exports and clients which are not logically affected by the change) and refresh everything by triggering upcalls to rpc.mountd. Most of the time this works fine, but there's an ugly interaction which can happen when a client is generating a workload comprising entirely WRITE calls. >From SGI bug PV978592: > After correlating network and system call traces and staring at a > bunch of code, I think I have a handle on what's going on. > > The scenario is this: > > > - single client is doing streaming 32K writes and, for a short period, > *no other* calls > - appman-ha runs "exportfs -i" > - exportfs updates the etab file > - exportfs stupidly flushes all the nfsd kernel caches (auth.unix.ip, > nfsd.fh, and nfsd.export) > - thus the next WRITE rpc received from the client fails to find a valid > object in the kernel auth.unix.ip cache > - so the kernel starts an upcall to rpc.mountd > - meanwhile, the rpc needs to be deferred in the kernel for later > re-processing when rpc.mountd gets its act together > - unfortunately, WRITE rpcs (even small 32K ones) exceed the size limit > for the rpc deferral, so the server silently drops the call, expecting > the client to retry > - back up in userspace, rpc.mountd notices that the etab file a new > mtime and re-reads it, as well as validating it by doing a bunch > of lstat() and stat() calls on all the export points and all their > ancestor directories. This is all good and necessary and takes > about 3-4 milliseconds. > - then rpc.mountd stupidly decides to update the rmtab file, which > involves writing a few hundred bytes and fdatasync()ing it, taking > about 5 milliseconds. > - rpc.mountd then writes enough data back into the kernel to enable > the server to answer NFS rpcs > - normally this would revisit the deferred rpcs which would be executed > and replies sent to the client, but in this case all the rpcs were > dropped so this can't happen > - server is now waiting, ready willing and able to respond to new rpcs > from the client > - unfortunately, because the server was AWOL for all of 8-9 ms, and > the client is sending piddly little 32K WRITEs, the client has in > the meantime sent quite a few rpcs, and has hit the clientside > limit of 16 outstanding rpcs. > - so the client cannot send more rpcs until the server replies to > at least one of the last 16 rpcs...but the server has forgotten > all of those rpcs and no replies are forthcoming > - finally, after about 95 sec the client's rpc timeout triggers > and the client retries the 1st of those 16 rpcs > - the server replies immediately and traffic flows again normally. > > End result: running "exportfs -i" to export filesystem A gives a 95 sec > interruption of NFS service to the unrelated innocent filesystem B, as > long as all the following are true: > > a) traffic comprises only WRITE rpcs > b) those WRITEs are large enough to defeat rpc deferral but > small enough that the rpc rate is relatively high > c) rpc.mountd updates rmtab in response to an auth.unix.ip upcall > d) exportfs flushes all the kernel caches > e) traffic is on TCP (longer timeout than UDP) This problem had a significant effect on the utility of SGI's HA code. Several attempts to fix it broke other code in surprising places. This attempt works as follows. The kernel keeps an effectively global generation number (genid). Interfaces are provided to userspace to allow querying the genid, and to allow atomically incrementing and querying the genid. After exportfs makes a change to etab it asks the kernel to increment the genid. When mountd wants to know if the etab has changed, it checks the genid and re-reads etab if the genid has changed since the last read. The export updates that mountd writes into the kernel are tagged with the genid that mountd thinks they belong to, and this is stored in the cache entry. Missing is a hunk to make cache_fresh() compare the genids of the entry and the cache_detail and if they differ start an upcall (but *not* consider the entry invalid, i.e. behave like the age > refresh_age/2 case). I haven't gone through a proper design verification, but I believe this design will solve the problem of losing WRITE calls when exportfs changes are made to other exports. The verification will be complex and messy because there are lots of racy asynchronous interactions between exporfs, mountd, the upcall mechanism, and incoming NFS calls. The new approach still has the problem that every exportfs change causes a flurry of upcalls, but at least NFS traffic should still flow. A useful side effect of this approach is that mountd no longer relies on the mtime of the etab file to determine whether to re-read the file. With current code, multiple exportfs changes made in the same second can be racily lost by mountd, or worse by *some* of several mountd threads. This behaviour makes the HA folks unhappy :-( There are two alternative approaches: a) allow large NFS calls to be deferred, up to the maximum wsize rather than just a page, or b) change call deferral to always block the calling thread instead of using a deferral record and returning -EAGAIN Both approaches have interesting and potentially frightening side effects, but could be made to work. I've discussed option b) with Bruce, and I understand the NFSv4.1 guys have their own reasons for wanting to do something like that. Maybe the above will help explain why the current call deferral behaviour gives me the irrits :-) Cc'ing Neil because he and I have discussed this problem. -- Greg Banks, P.Engineer, SGI Australian Software Group. the brightly coloured sporks of revolution. I don't speak for SGI.
kernel-side code to use the kernel's sunrpc cache genid to update the kernel export table without potentially causing NFS traffic lossage. Signed-off-by: Greg Banks <gnb@xxxxxxx> --- fs/nfsd/export.c | 16 ++++ include/linux/sunrpc/cache.h | 5 + net/sunrpc/cache.c | 108 +++++++++++++++++++++++++++++++- 3 files changed, 124 insertions(+), 5 deletions(-) Index: src/fs/nfsd/export.c =================================================================== --- src.orig/fs/nfsd/export.c +++ src/fs/nfsd/export.c @@ -348,7 +348,7 @@ static int check_export(struct inode *in static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen) { - /* client path expiry [flags anonuid anongid fsid] */ + /* [genid] client path expiry [flags anonuid anongid fsid] */ char *buf; int len; int err; @@ -356,6 +356,7 @@ static int svc_export_parse(struct cache struct nameidata nd; struct svc_export exp, *expp; int an_int; + int genid; nd.dentry = NULL; @@ -367,11 +368,19 @@ static int svc_export_parse(struct cache err = -ENOMEM; if (!buf) goto out; - /* client */ + /* client or genid */ len = qword_get(&mesg, buf, PAGE_SIZE); err = -EINVAL; if (len <= 0) goto out; + genid = cache_parse_genid(buf); + if (genid) { + /* got genid so next word is the client */ + len = qword_get(&mesg, buf, PAGE_SIZE); + err = -EINVAL; + if (len <= 0) goto out; + } + err = -ENOENT; dom = auth_domain_find(buf); if (!dom) @@ -385,6 +394,7 @@ static int svc_export_parse(struct cache if (err) goto out_no_path; exp.h.flags = 0; + exp.h.genid = genid; exp.ex_client = dom; exp.ex_mnt = nd.mnt; exp.ex_dentry = nd.dentry; @@ -515,6 +525,7 @@ static inline void svc_export_init(struc new->ex_dentry = dget(item->ex_dentry); new->ex_mnt = mntget(item->ex_mnt); new->ex_stats = exp_stats_find(new->ex_mnt, new->ex_dentry); + new->h.genid = item->h.genid; } static inline void svc_export_update(struct svc_export *new, struct svc_export *item) @@ -523,6 +534,7 @@ static inline void svc_export_update(str new->ex_anon_uid = item->ex_anon_uid; new->ex_anon_gid = item->ex_anon_gid; new->ex_fsid = item->ex_fsid; + new->h.genid = item->h.genid; } static DefineSimpleCacheLookup(svc_export,1) /* allow inplace updates */ Index: src/include/linux/sunrpc/cache.h =================================================================== --- src.orig/include/linux/sunrpc/cache.h +++ src/include/linux/sunrpc/cache.h @@ -51,6 +51,7 @@ struct cache_head { time_t last_refresh; /* If CACHE_PENDING, this is when upcall * was sent, else this is when update was received */ + int genid; /* generation at which this entry is valid */ atomic_t refcnt; unsigned long flags; }; @@ -87,6 +88,7 @@ struct cache_detail { */ time_t flush_time; /* flush all cache items with last_refresh * earlier than this */ + atomic_t genid; /* current generation */ struct list_head others; time_t nextcheck; int entries; @@ -99,6 +101,7 @@ struct cache_detail { struct proc_dir_entry *proc_ent; struct proc_dir_entry *flush_ent, *channel_ent, *content_ent; + struct proc_dir_entry *currgen_ent, *nextgen_ent; atomic_t readers; /* how many time is /chennel open */ time_t last_close; /* if no readers, when did last close */ @@ -291,6 +294,8 @@ extern void cache_purge(struct cache_det #define NEVER (0x7FFFFFFF) extern void cache_register(struct cache_detail *cd); extern int cache_unregister(struct cache_detail *cd); +extern int cache_parse_genid(const char *buf); +extern int cache_format_genid(char *buf, int maxlen, int genid); extern void qword_add(char **bpp, int *lp, char *str); extern void qword_addhex(char **bpp, int *lp, char *buf, int blen); Index: src/net/sunrpc/cache.c =================================================================== --- src.orig/net/sunrpc/cache.c +++ src/net/sunrpc/cache.c @@ -47,6 +47,7 @@ void cache_init(struct cache_head *h) atomic_set(&h->refcnt, 1); h->expiry_time = now + CACHE_NEW_EXPIRY; h->last_refresh = now; + h->genid = 0; } @@ -168,6 +169,8 @@ static int current_index; static struct file_operations cache_file_operations; static struct file_operations content_file_operations; static struct file_operations cache_flush_operations; +static struct file_operations currgen_file_operations; +static struct file_operations nextgen_file_operations; static DECLARE_WORK(cache_cleaner, do_cache_clean, NULL); @@ -197,6 +200,26 @@ void cache_register(struct cache_detail p->owner = cd->owner; p->data = cd; } + + p = create_proc_entry("currgen", S_IFREG|S_IRUSR, + cd->proc_ent); + cd->currgen_ent = p; + if (p) { + p->proc_fops = &currgen_file_operations; + p->owner = cd->owner; + p->data = cd; + + } + + p = create_proc_entry("nextgen", S_IFREG|S_IRUSR, + cd->proc_ent); + cd->nextgen_ent = p; + if (p) { + p->proc_fops = &nextgen_file_operations; + p->owner = cd->owner; + p->data = cd; + + } } if (cd->cache_show) { p = create_proc_entry("content", S_IFREG|S_IRUSR|S_IWUSR, @@ -216,6 +239,7 @@ void cache_register(struct cache_detail INIT_LIST_HEAD(&cd->to_write); spin_lock(&cache_list_lock); cd->nextcheck = 0; + atomic_set(&cd->genid, 0); cd->entries = 0; atomic_set(&cd->readers, 0); cd->last_close = 0; @@ -259,6 +283,10 @@ int cache_unregister(struct cache_detail remove_proc_entry("channel", cd->proc_ent); if (cd->content_ent) remove_proc_entry("content", cd->proc_ent); + if (cd->currgen_ent) + remove_proc_entry("currgen", cd->proc_ent); + if (cd->nextgen_ent) + remove_proc_entry("nextgen", cd->proc_ent); cd->proc_ent = NULL; remove_proc_entry(cd->name, proc_net_rpc); @@ -853,6 +881,23 @@ void qword_addhex(char **bpp, int *lp, c *lp = len; } +int cache_parse_genid(const char *buf) +{ + int genid; + + if (buf[10] != '\0' || + sscanf(buf, "#%08x#", &genid) != 1) + return 0; + return genid; +} +EXPORT_SYMBOL(cache_parse_genid); + +int cache_format_genid(char *buf, int maxlen, int genid) +{ + return snprintf(buf, maxlen, "#%08x#", genid); +} +EXPORT_SYMBOL(cache_format_genid); + static void warn_no_listener(struct cache_detail *cd) { if (cd->last_warn != cd->last_close) { @@ -1043,9 +1088,12 @@ static int c_show(struct seq_file *m, vo if (p == SEQ_START_TOKEN) return cd->cache_show(m, cd, NULL); - if (unlikely(debugcheck(CACHE))) - seq_printf(m, "# expiry=%ld refcnt=%d\n", - h->expiry_time, atomic_read(&h->refcnt)); + if (unlikely(debugcheck(CACHE))) { + char genid_buf[20]; + cache_format_genid(genid_buf, sizeof(genid_buf), h->genid); + seq_printf(m, "# expiry=%ld refcnt=%d genid=%s\n", + h->expiry_time, atomic_read(&h->refcnt), genid_buf); + } cache_get(h); if (cache_check(cd, h, NULL)) /* cache_check does a cache_put on failure */ @@ -1149,3 +1197,57 @@ static struct file_operations cache_flus .read = read_flush, .write = write_flush, }; + +static ssize_t +genid_do_read(struct cache_detail *cd, char __user *buf, size_t count, + int increment) +{ + int len; + int val; + char tmp[12]; + + if (count == 0) + return 0; + + if (increment) + val = atomic_inc_return(&cd->genid); + else + val = atomic_read(&cd->genid); + if (!val) + return 0; + + len = cache_format_genid(tmp, sizeof(tmp), val); + + if (count < len) + return -EFAULT; + if (copy_to_user(buf, tmp, len)) + return -EFAULT; + return len; +} + +static ssize_t +currgen_read(struct file *filp, char __user *buf, size_t count, loff_t *ppos) +{ + struct cache_detail *cd = PDE(filp->f_dentry->d_inode)->data; + return genid_do_read(cd, buf, count, 0); +} + +static struct file_operations currgen_file_operations = { + .owner = THIS_MODULE, + .open = nonseekable_open, + .read = currgen_read +}; + +static ssize_t +nextgen_read(struct file *filp, char __user *buf, size_t count, loff_t *ppos) +{ + struct cache_detail *cd = PDE(filp->f_dentry->d_inode)->data; + return genid_do_read(cd, buf, count, 1); +} + +static struct file_operations nextgen_file_operations = { + .owner = THIS_MODULE, + .open = nonseekable_open, + .read = nextgen_read +}; +
Userspace (mountd and exportfs) code to use the kernel's sunrpc cache genid to update the kernel export table without potentially causing NFS traffic lossage. Signed-off-by: Greg Banks <gnb@xxxxxxx> --- support/export/xtab.c | 2 support/include/nfslib.h | 5 + support/nfs/cacheio.c | 98 ++++++++++++++++++++++++++++-------- utils/mountd/auth.c | 56 +++++++++++++++++++- utils/mountd/cache.c | 5 + utils/mountd/mountd.h | 1 6 files changed, 145 insertions(+), 22 deletions(-) Index: nfs-utils-1.0.7.vanilla/support/export/xtab.c =================================================================== --- nfs-utils-1.0.7.vanilla.orig/support/export/xtab.c +++ nfs-utils-1.0.7.vanilla/support/export/xtab.c @@ -110,6 +110,8 @@ xtab_write(char *xtab, char *xtabtmp, in endexportent(); cond_rename(xtabtmp, xtab); + if (is_export) + cache_bump_genid("nfsd.export"); xfunlock(lockid); Index: nfs-utils-1.0.7.vanilla/utils/mountd/auth.c =================================================================== --- nfs-utils-1.0.7.vanilla.orig/utils/mountd/auth.c +++ nfs-utils-1.0.7.vanilla/utils/mountd/auth.c @@ -44,8 +44,45 @@ auth_init(char *exports) xtab_mount_write(); } -int -auth_reload() +int auth_genid = 0; + +static int +auth_genid_newer(void) +{ + static int fd = -2; + int len; + int genid; + char tmp[CACHE_GENID_LENGTH]; + + if (fd == -2) { + fd = open("/proc/net/rpc/nfsd.export/currgen", O_RDONLY, 0); + if (fd < 0) + fd = -1; + else + xlog(L_INFO, "Found currgen file, using genids\n"); + } + if (fd < 0) + return -1; + len = read(fd, tmp, sizeof(tmp)-1); + if (len < 0) + return -1; + tmp[len] = '\0'; + if ((genid = cache_parse_genid(tmp)) < 0) + return -1; + if (genid == auth_genid) + return 0; + { + char old[CACHE_GENID_LENGTH], new[CACHE_GENID_LENGTH]; + format_genid(old, sizeof(old), auth_genid); + format_genid(new, sizeof(new), genid); + xlog(L_INFO, "genid bumped from %s to %s\n", old, new); + } + auth_genid = genid; + return 1; +} + +static int +auth_etab_newer(void) { struct stat stb; static time_t last_modified = 0; @@ -54,7 +91,21 @@ auth_reload() xlog(L_FATAL, "couldn't stat %s", _PATH_ETAB); if (stb.st_mtime == last_modified) return 0; + xlog(L_INFO, "Etab mtime bumped from %ld to %ld\n", + last_modified, stb.st_mtime); last_modified = stb.st_mtime; + return 1; +} + + +int +auth_reload() +{ + int newer = auth_genid_newer(); + if (newer < 0) + newer = auth_etab_newer(); + if (!newer) + return 0; export_freeall(); memset(&my_client, 0, sizeof(my_client)); @@ -223,3 +274,4 @@ auth_fixpath(char *path) cp--; *cp = '\0'; } + Index: nfs-utils-1.0.7.vanilla/utils/mountd/cache.c =================================================================== --- nfs-utils-1.0.7.vanilla.orig/utils/mountd/cache.c +++ nfs-utils-1.0.7.vanilla/utils/mountd/cache.c @@ -282,6 +282,11 @@ void nfsd_export(FILE *f) } } + if (auth_genid) { + char buf[CACHE_GENID_LENGTH]; + cache_format_genid(buf, sizeof(buf), auth_genid); + qword_print(f, buf); + } qword_print(f, dom); qword_print(f, path); qword_printint(f, time(0)+30*60); Index: nfs-utils-1.0.7.vanilla/utils/mountd/mountd.h =================================================================== --- nfs-utils-1.0.7.vanilla.orig/utils/mountd/mountd.h +++ nfs-utils-1.0.7.vanilla/utils/mountd/mountd.h @@ -41,6 +41,7 @@ bool_t mount_mnt_3_svc(struct svc_req * void mount_dispatch(struct svc_req *, SVCXPRT *); void auth_init(char *export_file); int auth_reload(void); +extern int auth_genid; nfs_export * auth_authenticate(char *what, struct sockaddr_in *sin, char *path); void auth_export(nfs_export *exp); Index: nfs-utils-1.0.7.vanilla/support/nfs/cacheio.c =================================================================== --- nfs-utils-1.0.7.vanilla.orig/support/nfs/cacheio.c +++ nfs-utils-1.0.7.vanilla/support/nfs/cacheio.c @@ -211,42 +211,39 @@ int qword_get_int(char **bpp, int *anint return 0; } -#define READLINE_BUFFER_INCREMENT 2048 +#define READLINE_BUFFER PAGE_SIZE int readline(int fd, char **buf, int *lenp) { /* read a line into *buf, which is malloced *len long - * realloc if needed until we find a \n + * + * We do a single read which is as large as the largest + * message the line will send us, to avoid unnecessary + * drama with atomicity of partial reads in a multi- + * process mountd. + * * nul out the \n and return * 0 on eof, 1 on success */ int len; if (*lenp == 0) { - char *b = malloc(READLINE_BUFFER_INCREMENT); + char *b = malloc(READLINE_BUFFER); if (b == NULL) return 0; *buf = b; - *lenp = READLINE_BUFFER_INCREMENT; + *lenp = READLINE_BUFFER; } len = read(fd, *buf, *lenp); - if (len <= 0) + if (len == 0) + return 0; + if (len < 0) { + xlog(L_ERROR, "read() from cache failed: %s", strerror(errno)); + return 0; + } + if ((*buf)[len-1] != '\n') { + xlog(L_ERROR, "couldn't read whole message after %d bytes", len); return 0; - while ((*buf)[len-1] != '\n') { - /* now the less common case. There was no newline, - * so we have to keep reading after re-alloc - */ - char *new; - int nl; - *lenp += READLINE_BUFFER_INCREMENT; - new = realloc(*buf, *lenp); - if (new == NULL) - return 0; - *buf = new; - nl = read(fd, *buf + len, *lenp - len); - if (nl <= 0) - return 0; - len += nl; } (*buf)[len-1] = '\0'; return 1; @@ -310,3 +307,64 @@ cache_flush(int force) } } } + +int +cache_parse_genid(const char *buf) +{ + int genid = 0; + if (buf[10] == '\0' && sscanf(buf, "#%08x#", &genid) == 1) + return genid; + return -1; +} + +void +cache_format_genid(char *buf, size_t maxlen, int genid) +{ + snprintf(buf, maxlen, "#%08x#", genid); +} + +static int +cache_read_genid(const char *cachename, int bump) +{ + int fd; + int len; + int genid; + char tmp[CACHE_GENID_LENGTH]; + char path[200]; + + snprintf(path, sizeof(path), "/proc/net/rpc/%s/%s", + cachename, (bump ? "nextgen" : "currgen")); + fd = open(path, O_RDONLY, 0); + if (fd < 0) + return -1; + + if (bump) + xlog(L_INFO, "Found %s file, bumping genid", ); + + len = read(fd, tmp, sizeof(tmp)-1); + close(fd); + + if (len < 0) + return -1; + tmp[len] = '\0'; + genid = cache_parse_genid(tmp); + if (genid < 0) + return -1; + if (bump) + xlog(L_INFO, "genid bumped to %s\n", tmp); + else + xlog(L_INFO, "read genid %s\n", tmp); + return genid; +} + +int +cache_get_genid(const char *cachename) +{ + return cache_read_genid(cachename, /*bump*/1); +} + +int +cache_bump_genid(const char *cachename) +{ + return cache_read_genid(cachename, /*bump*/0); +} Index: nfs-utils-1.0.7.vanilla/support/include/nfslib.h =================================================================== --- nfs-utils-1.0.7.vanilla.orig/support/include/nfslib.h +++ nfs-utils-1.0.7.vanilla/support/include/nfslib.h @@ -135,6 +135,11 @@ int readline(int fd, char **buf, int *le int qword_get(char **bpp, char *dest, int bufsize); int qword_get_int(char **bpp, int *anint); void cache_flush(int force); +#define CACHE_GENID_LENGTH 12 +int cache_parse_genid(const char *buf); +int cache_get_genid(const char *cachename); +int cache_bump_genid(const char *cachename); +void cache_format_genid(char *buf, size_t maxlen, int genid); int check_new_cache(void); void qword_add(char **bpp, int *lp, char *str); void qword_addhex(char **bpp, int *lp, char *buf, int blen);