+Trond, Chuck, Anna Ping... > -----Original Message----- > From: Robert Milkowski <rmilkowski@xxxxxxxxx> > Sent: 05 May 2020 18:35 > To: linux-nfs@xxxxxxxxxxxxxxx > Subject: NFS v4 + kerberos: 4 minute window of slowness > > Hi, > > Currently the last 4 minutes before kernel gss context expires, all > writes to NFSv4 are synchronous and all dirty pages associated with the > file being written to are being destaged. > This will continue for the 4 minutes until the context expires, at which > point it gets refreshed and everything gets back to normal. > > The rpc.gssd by default sets the timeout to match the Kerberos service > ticket, but this can be lowered by using -t option. > In fact many sites set it to lower value, like for example 30 minutes. > This means that every 30 minutes, the last 4 minutes results in severely > slower writes (assuming these are buffered - no O_DSYNC, etc.). > > In extreme case, when one sets the timeout to 5 minutes, during the 4 > minutes out of the minutes, there will be the slowness observed. > > > I understand the idea behind this mechanism - I guess it tries to avoid > situation when a gss context can't be refreshed (due to error or account > disabled, etc.), and it expires suddenly nfs client wouldn't be able to > destage all the buffered writes. The way it is currently implemented > though is rather crude. > In my opinion, instead of making everything slow for the whole 4 > minutes, it should first try to refresh the context immediately and if > successful things go back to normal, if it can't refresh the context > then it should continue with the previous one and revert to the current > behaviour. I implemented a naïve quick fix which does exactly that > (attached at the end of this email). > > > How to re-produce. > > > $ uname -r > 5.7.0-rc4+ > > $ grep -- -t /etc/sysconfig/nfs > RPCGSSDARGS="-t 300" > > I'm setting it to 5 minutes so I can quickly see the behaviour without > having to wait for too long. > > > Now, let's generate a small write every 10s to a file on nfsv4,sec=krb5 > filesystem and record how long each write takes. > Since these are buffered writes it should be very quick most of the > time. > > $ while [ 1 ]; do strace -qq -tT -v -e trace=write /bin/echo aa >f1; rm > f1; sleep 10; done > > 15:22:41 write(1, "aa\n", 3) = 3 <0.000108> > 15:22:51 write(1, "aa\n", 3) = 3 <0.000113> > 15:23:01 write(1, "aa\n", 3) = 3 <0.000111> > 15:23:11 write(1, "aa\n", 3) = 3 <0.000112> > 15:23:21 write(1, "aa\n", 3) = 3 <0.001510> <<<<<< > becomes > slow > 15:23:31 write(1, "aa\n", 3) = 3 <0.001622> > 15:23:41 write(1, "aa\n", 3) = 3 <0.001553> > 15:23:51 write(1, "aa\n", 3) = 3 <0.001495> > ... > 15:27:01 write(1, "aa\n", 3) = 3 <0.001528> > 15:27:12 write(1, "aa\n", 3) = 3 <0.001553> > 15:27:22 write(1, "aa\n", 3) = 3 <0.000104> <<<<<< > becomes > fast again > 15:27:32 write(1, "aa\n", 3) = 3 <0.000125> > 15:27:42 write(1, "aa\n", 3) = 3 <0.000129> > 15:27:52 write(1, "aa\n", 3) = 3 <0.000113> > 15:28:02 write(1, "aa\n", 3) = 3 <0.000112> > 15:28:12 write(1, "aa\n", 3) = 3 <0.000112> > 15:28:22 write(1, "aa\n", 3) = 3 <0.001510> <<<<<< slow > ... > 15:32:02 write(1, "aa\n", 3) = 3 <0.001501> > 15:32:12 write(1, "aa\n", 3) = 3 <0.001440> > 15:32:22 write(1, "aa\n", 3) = 3 <0.000136> <<<<<< fast > 15:32:32 write(1, "aa\n", 3) = 3 <0.000109> > 15:32:42 write(1, "aa\n", 3) = 3 <0.000110> > 15:32:52 write(1, "aa\n", 3) = 3 <0.000112> > 15:33:02 write(1, "aa\n", 3) = 3 <0.000103> > 15:33:12 write(1, "aa\n", 3) = 3 <0.000112> > 15:33:22 write(1, "aa\n", 3) = 3 <0.001405> <<<<<< slow > 15:33:32 write(1, "aa\n", 3) = 3 <0.001393> > 15:33:42 write(1, "aa\n", 3) = 3 <0.001479> > ... > > > > So we have 4 minute long windows of slowness followed by 1 minute window > when writes are fast. > > 15:23:21 - 15:27:22 slow > 15:27:22 - 15:28:22 fast > 15:28:22 - 15:32:22 slow > 15:32:22 - 15:33:22 fast > > > > After some tracing with systemtap and looking at the source code, I > found where the issue is coming from. > The nfs_file_write() function ends up calling nfs_ctx_key_to_expire() on > each write, which in turn calls gss_key_timeout() which has hard-coded > value of 240s (GSS_KEY_EXPIRE_TIMEO/gss_key_expire_timeo). > > > nfs_file_write() > ... > result = nfs_key_timeout_notify(file, inode); > if (result) > return result; > ... > if (nfs_need_check_write(file, inode)) { > int err = nfs_wb_all(inode); ... > > > /* > * Avoid buffered writes when a open context credential's key would > * expire soon. > * > * Returns -EACCES if the key will expire within RPC_KEY_EXPIRE_FAIL. > * > * Return 0 and set a credential flag which triggers the inode to flush > * and performs NFS_FILE_SYNC writes if the key will expired within > * RPC_KEY_EXPIRE_TIMEO. > */ > int > nfs_key_timeout_notify(struct file *filp, struct inode *inode) { > struct nfs_open_context *ctx = nfs_file_open_context(filp); > > if (nfs_ctx_key_to_expire(ctx, inode) && > !ctx->ll_cred) > /* Already expired! */ > return -EACCES; > return 0; > } > > > nfs_need_check_write() > ... > if (nfs_ctx_key_to_expire(ctx, inode)) > return 1; > return 0; > > > > nfs_write_end() > ... > if (nfs_ctx_key_to_expire(ctx, mapping->host)) { > status = nfs_wb_all(mapping->host); ... > > > > /* > * Test if the open context credential key is marked to expire soon. > */ > bool nfs_ctx_key_to_expire(struct nfs_open_context *ctx, struct inode > *inode) > { > struct rpc_auth *auth = NFS_SERVER(inode)->client->cl_auth; > struct rpc_cred *cred = ctx->ll_cred; > struct auth_cred acred = { > .cred = ctx->cred, > }; > > if (cred && !cred->cr_ops->crmatch(&acred, cred, 0)) { > put_rpccred(cred); > ctx->ll_cred = NULL; > cred = NULL; > } > if (!cred) > cred = auth->au_ops->lookup_cred(auth, &acred, 0); > if (!cred || IS_ERR(cred)) > return true; > ctx->ll_cred = cred; > return !!(cred->cr_ops->crkey_timeout && > cred->cr_ops->crkey_timeout(cred)); > } > > > > net/sunrpc/auth_gss/auth_gss.c: .crkey_timeout = > gss_key_timeout, > > > /* > * Returns -EACCES if GSS context is NULL or will expire within the > * timeout (miliseconds) > */ > static int > gss_key_timeout(struct rpc_cred *rc) > { > struct gss_cred *gss_cred = container_of(rc, struct gss_cred, > gc_base); > struct gss_cl_ctx *ctx; > unsigned long timeout = jiffies + (gss_key_expire_timeo * HZ); > int ret = 0; > > rcu_read_lock(); > ctx = rcu_dereference(gss_cred->gc_ctx); > if (!ctx || time_after(timeout, ctx->gc_expiry)) > ret = -EACCES; > rcu_read_unlock(); > > return ret; > } > > > > > #define GSS_KEY_EXPIRE_TIMEO 240 > static unsigned int gss_key_expire_timeo = GSS_KEY_EXPIRE_TIMEO; > > > > > > A naïve attempt at a fix: > > > diff --git a/net/sunrpc/auth_gss/auth_gss.c > b/net/sunrpc/auth_gss/auth_gss.c index 25fbd8d9de74..864661bdfdf3 100644 > --- a/net/sunrpc/auth_gss/auth_gss.c > +++ b/net/sunrpc/auth_gss/auth_gss.c > @@ -1477,6 +1477,8 @@ gss_key_timeout(struct rpc_cred *rc) > > rcu_read_lock(); > ctx = rcu_dereference(gss_cred->gc_ctx); > + if (ctx && time_after(timeout + (60 * HZ), ctx->gc_expiry)) > + clear_bit(RPCAUTH_CRED_UPTODATE, &rc->cr_flags); > if (!ctx || time_after(timeout, ctx->gc_expiry)) > ret = -EACCES; > rcu_read_unlock(); > > > > > With the above patch, if there is a write within 300s before a context > is to expire (use RPCGSSDARGS="-t 400" or any value larger than 300 to > test), it will now try to refresh the context and if successful then > writes will be fast again (assuming -t option is >300s and/or krb ticket > is valid for more than 300s). > > What I haven't tested nor analysed code is what would happen if it now > fails to refresh the context, but after a quick glance at gss_refresh() > it does seem it would continue using the previous cred... > > Is this the correct approach to fix the issue, or can you suggest some > other approach? > > > > -- > Robert Milkowski